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Computational Nucleic Acid Coding and Feature Analysis 
Field of the Invention 

The present invention is in the field of bioinformatics, particularly as it pertains to gene 
prediction. More specifically, the invention relates to the probabilistic analysis of nucleic acid 
5 sequences for the determination of coding features, including determination of state probabilities 
for each nucleotide in a nucleic acid sequence, determination of coding strand, determination of 
open reading frame extent, determination of insertion and deletion location, determination of 
exon location, and determination of protein sequence. 

flD Background of the Invention 

£0 Advances in techniques for sequencing long stretches of genomic deoxyribonucleic acid 

CQ (DNA) have allowed investigators to collect vast nucleic acid sequence data rapidly. These 
J T advances, combined with initiatives to sequence the entire human genome and the genomes of 
^ several other species, have created a need for the rapid identification of genes on long stretches 
N^5 of sequenced DNA. Conventional gene location techniques, such as cDNA hybridization, are 
u 'l effective at locating transcribed genes, but are time-consuming and costly. 
^ An alternative for locating genes on DNA that has not otherwise been analyzed for 

O potential coding regions involves using statistical detection methods. Such methods 

conventionally include using probability models to predict where in a DNA sequence a gene is 
20 located. The theoretical nucleic acid sequence probabilities can be determined through analysis 
of known coding regions in the organism of interest. Once theoretical nucleic acid sequence 
probabilities are determined, nucleic acid sequences in unannotated regions of DNA in the same 
or a similar organism can be statistically compared to the theoretical nucleic acid sequence 
probabilities. If the similarity is sufficient, the investigator is notified that a coding sequence 
25 exists. Conventional cloning techniques can then be used to isolate the putative gene and check 
for transcription. 

One type of statistical detection method searches DNA by content. In such content- 
based models, highly conserved regions of DNA that are common to all genes are located. If a 
conserved region of DNA is found, then the nucleic acid sequence associated with the conserved 
30 region can be compared with known genes. Such comparisons, which can be done with nucleic 
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acid sequence comparison programs such as BLAST, are inefficient to run, however, and 
content-based searches therefore have Hmited desirability. 

A second type of statistical detection method searches DNA by signal. This type of 
searching involves using probability models to predict whether DNA fragments within a larger 
5 nucleic acid sequence are coding. Early searching by signal programs, such as TestCode and 
Grail, relied on statistical variations within coding regions of DNA, including codon frequency, 
local nucleic acid sequence composition, codon preference measures, heuristics based on 
oligonucleotide frequency variations, and measures of nucleic acid sequence complexity. 

Beyond simple gene detection, there is also a need for the determination of other coding 
^fo features, such as the location of intron/exon boimdaries in eukaryotic organisms and the location 
g1 of insertions or deletions. The program GENSCAN (Burge, C. and Karlin, S. (1997) Prediction 
fS of Complete Gene Structures in Human Genomic DNA. J. Mol Biol 268, 78-94), fox example, 
! ^ predicts exon location with local state probabilities based on oligonucleotide usage. GENSCAN, 
W however, also depends on non-local nucleic acid sequence characteristics, which make the 
lX5 program very sensitive to sequencing errors and genes containing alternative splicing strategies. 
H One statistical model that avoids the problems caused by dependence on non-local 

C- nucleic acid sequence characteristics is the inhomogeneous Markov model. An inhomogeneous 
□ Markov model depends upon local probabilities, and is not therefore sensitive to sequencing 

errors or genes with alternative splicing strategies. The inhomogeneous Markov model is 
20 "inhomogeneous" because it determines the state probabilities for a given nucleotide in multiple 
reading frames rather than in a single reading frame. GeneMark, for example, is a computer 
program that uses the inhomogeneous Markov model to locate genes. 

The GeneMark gene prediction algorithm was developed in several steps. A series of 
three publications demonstrated that inhomogeneous Markov models were useful tools for gene 
25 prediction {see Borodovsky, M., Sprizhitsky Yu., Golovanov E. and Alexandrov A. (1986) 
Statistical Patterns in Primary Structures of Functional Regions in the E. Coli Genome: I. 
Oligonucleotide Frequencies Analysis, Molecular Biology, 20, 826-833, Borodovsky, M., 
Sprizhitsky Yu, Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary 
Structures of Functional Regions in the E. Coli Genome: IL Non-homogeneous Markov Models, 
30 Molecular Biology, 20, 833-840, Borodovsky, M., Sprizhitsky Yu., Golovanov E. and 
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Alexandrov A. (1986^ Statistical Patterns in Primary Structures of Functional Regions in the E. 
Coli Genome: III Computer Recognition of Coding Regions, Molecular Biology, 20, 1 145-1 150, 
all of which are herein incorporated by reference in their entirety). The GeneMark method was 
based on an inhomogeneous Markov model and was described in 1993 (see Borodovsky, M. 
5 and Mclninch J. (1 993) GeneMark, Parallel Gene Recognition for both DNA Strands, Computers 
& Chemistry, 17, 123-133, and Borodovsky, M. and Mclninch J. (1993) BioSystems v30, pp. 
161-171, both of which are herein incorporated by reference in their entirety). The capabilities of 
the GeneMark program were subsequently investigated (see James D. Mclninch, Prediction of 
Protein Coding Regions in Unannotated DNA sequences Using an Inhomogeneous Markov 
Hb Model of Genetic Information Encoding (1997) (Ph.D. dissertation, Georgia Institute of 
Co Technology, on file with the Georgia Institute of Technology Library, which is herein 
f S incorporated by reference in its entirety). 

J ^ Conventional programs using inhomogeneous Markov models, however, are limited to a 

iy defined probabilistic model for determining probability, and cannot be tailored by the 

%15 investigator to better suit the nucleic acid sequence under study if information about that nucleic 

acid sequence is already available. Further, conventional implementations do not allow for the 
y efficient and accurate detection of other nucleic acid sequence features. 

C3 What is needed in the art is a method of determining state probabilities for a nucleic acid 

sequence having some known characteristics, where the method is insensitive to frameshift 

20 insertions or deletions, and compatible methods for detecting other nucleic acid sequence 
features in known or unknown nucleic acid sequences. 

Summary Of The Invention 

The present invention relates to the probabilistic analysis of nucleic acid sequences for 
25 the determination of coding features, including determination of state probabilities for each 
nucleotide in a nucleic acid sequence, determination of coding strand, determination of open 
reading frame extent, determination of insertion and deletion location, determination of exon 
location, and determination of protein sequence. Described herein are methods, devices, and 
systems for analyzing the information content in nucleic acids. 
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The present invention includes and provides a method for determining a probabihty for 
one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 
nucleic acid sequence; b) determining transition probabilities for each of the states for 
5 nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
a probability for the nucleic acid sequence for each of the states; and, d) determining a 
probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
CSb one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
tn initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 

nucleic acid sequence; b) determining transition probabilities for each of the states for 
i ^ nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
ly a probability for the nucleic acid sequence for each of the states; and, d) determining a 
lis probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
J^'J sequence, wherein the determining a probability for each of the states is capable of accepting a 

bias, 

f 5 The present invention includes and provides a method for determining a probability for 

each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 

20 determining an initial oligonucleotide probability for each of the states for an initial 

oligonucleotide in a window of a first nucleotide; b) determining transition probabilities for each 
of the states for nucleotides within the window following the initial oligonucleotide; c) 
determining a probability for the window for each of the states; d) determining a probability for 
each of the states for the nucleotide based upon the probability for the window and a bias; and, e) 

25 repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 

The present invention includes and provides a method for determining strand coding of a 
nucleic acid sequence based upon a bias, comprising: a) determining a probability of each of one 
or more states for each nucleotide in the nucleic acid sequence, wherein each of the states is 
either a positive strand state or a negative strand state; b) summing the probabilities of the 

30 positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
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states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
probabilities for negative states is less than a threshold value; ii) coding is on the positive strand 
5 if a second function of the sum of probabilities for positive states is greater than a third function 
of the sum of probabilities for negative states and the first function is not less than the threshold 
value; and iii) coding is on the negative strand if the second function of the sum of probabilities 
for positive states is not greater than the third function of the sum of probabilities for negative 
states and the first function is not less than the threshold value. 

CSo The present invention includes and provides a method for determining the extent of an 

m open reading frame within a nucleic acid sequence based upon a bias, comprising: a) determining 
the probability of each of one or more states for each nucleotide in the nucleic acid sequence, 

J ^ wherein each of the states is either a coding state or a noncoding state; b) determining the coding 

m strand of the nucleic acid sequence; and, c) determining the points within the nucleic acid 

[45 sequence in the coding strand at which the sum of the probabilities of the coding states for each 
nucleotide drops below a first threshold value for a number of nucleotides greater than a second 

t3 threshold value, wherein ends of the open reading frame are indicated at the points. 

f5 The present invention includes and provides a method for determining the location of 

insertions and deletions within a nucleic acid sequence, comprising: a) determining the 

20 probability of each of one or more states for each nucleotide in the nucleic acid sequence based 
upon a bias, wherein each of the states is either a coding state or a noncoding state; b) setting a 
length for a window; c) determining which state has a maximum mean probability for the nucleic 
acid sequence on a first side of a middle nucleotide in the window, wherein the window begins at 
a first nucleotide; d) determining which state has a maximum mean probability for the nucleic 

25 acid sequence on a second side of the middle nucleotide in the window; e) determining that a 
deletion or insertion occurred at the middle nucleotide if i) the state with the maximum mean 
probability on the first side of the middle nucleotide is different from the state with the maximum 
mean probability on the second side of middle nucleotide, and ii) either an average of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 

30 average of hypothetical state probabilities for the window with a deletion at the middle 
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nucleotide is greater than a sum of the middle nucleotide's coding states probabilities; and, f) 
repeating steps c) through e) for each remaining nucleotide in the nucleic acid sequence after the 
first nucleotide, wherein the window begins at each remaining nucleotide in turn. 

The present invention includes and provides a method for determining exon location 
5 within a nucleic acid sequence, comprising a) determining the probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
states is either a coding state or noncoding state; b) determining the coding strand of the nucleic 
acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 
% probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
Co and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 
m The present invention includes and provides a program storage device readable by a 

; machine, tangibly embodying a program of instructions executable by a machine to perform 
IM method steps to determine a probability for each of one or more states for a nucleotide in a 
m nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 
probability for each of the states for an initial oligonucleotide in the nucleic acid sequence; b) 
-3 determining transition probabilities for each of the states for nucleotides within the nucleic acid 
O sequence following the initial oligonucleotide; c) determining a probability for the nucleic acid 

sequence for each of the states; and, d) determining a probability for each of the states for the 
20 nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine a probability for one or more states for more than one nucleotide in a 
nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 
25 probability for each of the states for an initial oligonucleotide in a window of a first nucleotide; 
b) determining transition probabilities for each of the states for nucleotides within the window 
following the initial oligonucleotide; c) determining a probability for the window for each of the 
states; d) determining a probability for each of the states for the nucleotide based upon the 
probability for the window and a bias; and, e) repeating steps a) through d) for each remaining 
30 nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine strand coding of a nucleic acid sequence, the method steps 
comprising: a) determining a probability of each of one or more states for each nucleotide in the 
5 nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 
% first function of the sum of probabilities for positive states and the sum of probabilities for 
u: negative states is less than a threshold value; ii) coding is on the positive strand if a second 
m function of the sum of probabilities for positive states is greater than a third function of the sum 
I ^ of probabilities for negative states and the first function is not less than the threshold value; and 
iy iii) coding is on the negative strand if the second function of the sum of probabilities for positive 
m states is not greater than the third function of the sum of probabilities for negative states and the 

first function is not less than the threshold value. 
O The present invention includes and provides a program storage device readable by a 

[J machine, tangibly embodying a program of instructions executable by a machine to perform 

method steps to determine the extent of an open reading frame within a nucleic acid sequence, 
20 the method steps comprising: a) determining the probability of each of one or more states for 
each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 
sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 
25 threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine the location of insertions and deletions within a nucleic acid sequence, 
30 the method steps comprising: a) determining the probability of each of one or more states for 
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each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) setting a length for a window; c) determining which 
state has a maximum mean probability for the nucleic acid sequence on a first side of a middle 
nucleotide in the window, wherein the window begins at a first nucleotide; d) determining which 
5 state has a maximum mean probability for the nucleic acid sequence on a second side of the 
middle nucleotide in the window; e) determining that a deletion or insertion occurred at the 
middle nucleotide if i) the state with the maximum mean probability on the first side of the 
middle nucleotide is different from the state with the maximum mean probability on the second 
side of middle nucleotide, and ii) either an average of hypothetical state probabilities for the 
% window with an insertion at the middle nucleotide or an average of hypothetical state 
ffi probabilities for the window with a deletion at the middle nucleotide is greater than a sum of the 
f 2 middle nucleotide's coding states probabilities; and, f) repeating steps c) through e) for each 
! ^ remaining nucleotide in the nucleic acid sequence after the first nucleotide, wherein the window 
iy begins at each remaining nucleotide in turn. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
53 method steps to determine exon location within a nucleic acid sequence, the method steps 
C3 comprising: a) determining the probability of each of one or more states for each nucleotide in 

the nucleic acid sequence based upon a bias, wherein each of the states is either a coding state or 
20 noncoding state; b) determining the coding strand of the nucleic acid sequence; c) determining 
the extent of an open reading frame within the nucleic acid sequence; d) classifying each 
nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
strand; e) reclassifying each nucleotide according to defined rules; and, f) determining that 
regions of the nucleic acid sequence in the coding class are exons. 
25 The present invention includes and provides a computer system for determining a 

probability for each of one or more states for a nucleotide in a nucleic acid sequence, comprising: 
an input device for inputting the nucleic acid sequence; a memory for storing the nucleic acid 
sequence; a processing unit configured for retrieving the nucleic acid sequence and for: a) 
determining an initial oligonucleotide probability for each of the states for an initial 
30 oligonucleotide in the nucleic acid sequence; b) determining transition probabilities for each of 
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the states for nucleotides within the nucleic acid sequence following the initial oligonucleotide; 
c) determining a probability for the nucleic acid sequence for each of the states; and, d) 
determining a probability for each of the states for the nucleotide based upon the probability of 
the nucleic acid sequence and a bias. 
5 The present invention includes and provides a computer system for determining a 

probability for each of one or more states for more than one nucleotide in a nucleic acid 
sequence, comprising: an input device for inputting the nucleic acid sequence; a memory for 
storing the nucleic acid sequence; a processing unit configured for retrieving the nucleic acid 
sequence and for: a) determining an initial oligonucleotide probability for each of the states for 
% an initial oligonucleotide in a window of a first nucleotide; b) determining transition probabilities 
Cm for each of the states for nucleotides within the window following the initial oligonucleotide; c) 
5 determining a probability for the window for each of the states; d) determining a probability for 
J ^ each of the states for the nucleotide based upon the probability for the window and a bias; and, e) 
W repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 
1,15 The present invention includes and provides a computer system for determining strand 

; i coding of a nucleic acid sequence, comprising: an input device for inputting the nucleic acid 
y sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
I4 retrieving the nucleic acid sequence and for: a) determining a probability of each of one or more 

states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
20 states is either a positive strand state or a negative strand state; b) summing the probabilities of 
the positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
25 probabilities for negative states is less than a threshold value; ii) coding is on tiie positive strand 
if a second function of the sum of probabilities for positive states is greater than a third function 
of the sum of probabilities for negative states and the first function is not less than the threshold 
value; and iii) coding is on the negative strand if the second fianction of the sum of probabilities 
for positive states is not greater than the third fimction of the sum of probabilities for negative 
30 states and the first function is not less than the threshold value. 
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The present invention includes and provides a computer system for determining the 
extent of an open reading frame within a nucleic acid sequence, comprising: an input device for 
inputting a nucleic acid sequence; a memory for storing the nucleic acid sequence; a processing 
unit configured for retrieving the nucleic acid sequence and for: a) determining the probability of 
5 each of one or more states for each nucleotide in the nucleic acid sequence based upon a bias, 
wherein each of the states is either a coding state or a noncoding state; b) determining the coding 
strand of the nucleic acid sequence; and, c) determining the points within the nucleic acid 
sequence in the coding strand at which the sum of the probabilities of the coding states for each 
nucleotide drops below a first threshold value for a number of nucleotides greater than a second 
threshold value, wherein ends of the open reading frame are indicated at the points. 
m The present invention includes and provides a computer system for determining the 

m location of insertions and deletions vrithin a nucleic acid sequence, comprising: an input device 
I for inputting a nucleic acid sequence; a memory for storing the nucleic acid sequence; a 
U processing unit configured for retrieving the nucleic acid sequence and for: a) determining the 
^5 probability of each of one or more states for each nucleotide in the nucleic acid sequence based 
H upon a bias, wherein each of the states is either a coding state or a noncoding state; b) setting a 
O length for a window; c) determining which state has a maximum mean probability for the nucleic 
£3 acid sequence on a first side of a middle nucleotide in the window, wherein the window begins at 

a first nucleotide; d) determining which state has a maximum mean probability for the nucleic 
20 acid sequence on a second side of the middle nucleotide in the window; e) determining that a 
deletion or insertion occurred at the middle nucleotide if i) the state with the maximum mean 
probability on the first side of the middle nucleotide is different from the state with the maximum 
mean probability on the second side of middle nucleotide, and ii) either an average of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 
25 average of hypothetical state probabilities for the window with a deletion at the middle 

nucleotide is greater than a sum of the middle nucleotide's coding states probabilities; and, f) 
repeating steps c) through e) for each remaining nucleotide in the nucleic acid sequence after the 
first nucleotide, wherein the window begins at each remaining nucleotide in turn. 

The present invention includes and provides a computer system for determining exon 
30 location within a nucleic acid sequence, comprising: an input device for inputting a nucleic acid 
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sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
retrieving the nucleic acid sequence and for: a) determining the probability of each of one or 
more states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of 
the states is either a coding state or noncoding state; b) determining the coding strand of the 
5 nucleic acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 
probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a computer program product comprising a 
^lo computer usable medium having computer program logic recorded thereon for enabling a 
01 processor in a computer system to determine a probability for each of one or more states for a 
f S nucleotide in a nucleic acid sequence, the computer program logic comprising means for 
} ^ enabling the processor to perform each of the following steps: a) determining an initial 
iy oligonucleotide probability for each of the states for an initial oligonucleotide in the nucleic acid 
sequence; b) determining transition probabilities for each of the states for nucleotides within the 
7^, nucleic acid sequence following the initial oligonucleotide; c) determining a probability for the 
O nucleic acid sequence for each of the states; and, d) determining a probability for each of the 
C3 states for the nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a computer program product comprising a 
20 computer usable medium having computer program logic recorded thereon for enabling a 

processor in a computer system to determine a probability for each of one or more states for more 
than one nucleotide in a nucleic acid sequence, the computer program logic comprising means 
for enabling the processor to perform each of the following steps: a) determining an initial 
oligonucleotide probability for each of the states for an initial oligonucleotide in a window of a 
25 first nucleotide; b) determining transition probabilities for each of the states for nucleotides 
within the window following the initial oligonucleotide; c) determining a probability for the 
window for each of the states; d) determining a probability for each of the states for the 
nucleotide based upon the probability for the window and a bias; and, e) repeating steps a) 
through d) for each remaining nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine strand coding of a nucleic acid sequence, the 
computer program logic comprising means for enabling the processor to perform each of the 
5 following steps: a) determining a probability of each of one or more states for each nucleotide in 
the nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
ESo probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 
[S first function of the sum of probabilities for positive states and the sum of probabilities for 
J^I negative states is less than a threshold value; ii) coding is on the positive strand if a second 
^ function of the sum of probabilities for positive states is greater than a third function of the sum 
iy of probabilities for negative states and the first function is not less than the threshold value; and 
[jL5 iii) coding is on the negative strand if the second function of the sum of probabilities for positive 
H states is not greater than the third function of the sum of probabilities for negative states and the 
C3 first function is not less than the threshold value. 

f S The present invention includes and provides a computer program product comprising a 

computer usable medium having computer program logic recorded thereon for enabling a 

20 processor in a computer system to determine the extent of an open reading frame within a nucleic 
acid sequence, the computer program logic comprising means for enabling the processor to 
perform each of the following steps: a) determining the probability of each of one or more states 
for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 

25 sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 
threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a computer program product comprising a 

30 computer usable medium having computer program logic recorded thereon for enabling a 
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processor in a computer system to determine the location of insertions and deletions within a 
nucleic acid sequence, the computer program logic comprising means for enabling the processor 
to perform each of the following steps: a) determining the probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
5 states is either a coding state or a noncoding state; b) setting a length for a window; c) 

determining which state has a maximum mean probability for the nucleic acid sequence on a first 
side of a middle nucleotide in the window, wherein the window begins at a first nucleotide; d) 
determining which state has a maximum mean probability for the nucleic acid sequence on a 
second side of the middle nucleotide in the window; e) determining that a deletion or insertion 
£a occurred at the middle nucleotide if i) the state with the maximum mean probability on the first 
m side of the middle nucleotide is different fi-om the state witii the maximum mean probability on 
^ tiie second side of middle nucleotide, and ii) eitiier an average of hypothetical state probabilities 
fU for the window with an insertion at the middle nucleotide or an average of hypothetical state 
Ly probabilities for the vdndow witii a deletion at the middle nucleotide is greater than a sum of the 
Jl5 middle nucleotide's coding states probabilities; and, f) repeating steps c) through e) for each 

remaining nucleotide in the nucleic acid sequence after the first nucleotide, wherein the window 
O begins at each remaining nucleotide in turn. 

7Z The present invention includes and provides a computer program product comprising a 

computer usable medium having computer program logic recorded thereon for enabling a 

20 processor in a computer system to determine exon location within a nucleic acid sequence, the 
computer program logic comprising means for enabling the processor to perform each of the 
following steps: a) determining tiie probability of each of one or more states for each nucleotide 
in the nucleic acid sequence based upon a bias, wherein each of the states is either a coding state 
or noncoding state; b) determining the coding strand of the nucleic acid sequence; c) determining 

25 the extent of an open reading frame within the nucleic acid sequence; d) classifying each 

nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
strand; e) reclassifying each nucleotide according to defined rules; and, f) determining that 
regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a method for determining a probability for 

30 one or more states for a nucleotide in a nucleic acid sequence, comprising determining a 
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probability for each of the states for the nucleotide based upon a probability of the nucleic acid 
sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 
5 determining a probability for each of the states for a first nucleotide in the nucleic acid sequence 
based upon a probability of a v^indow in which the first nucleotide is located and a bias; and, b) 
repeating step a) for the remaining nucleotides in the nucleic acid sequence. 

Description Of The Figures 

f lO Figure 1 is a flow chart representing one embodiment of a method for determining the 

m probability of each of the possible states for a single nucleotide in a nucleic acid sequence; 

Figure 2 is a flow chart representing one embodiment of a method for determining the 
!U probability of each of the possible states for a multiple nucleotides in a nucleic acid sequence; 
[y Figure 3 is a flow chart representing one embodiment of a method for determining the 

^15 coding strand of a nucleic acid sequence; 

U Figure 4 is a flow chart representing one embodiment of a method for determining the 

£3 extent of an open reading frame within a nucleic acid sequence; 

S Figure 5 is a flow chart representing one embodiment of a method for determining the 

location of insertions and deletions within a nucleic acid sequence; 
20 Figure 6 is a flow chart representing one embodiment of a method for determining the 

extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 7 is a flow chart representing one embodiment of a method for determining the 
extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 8a is a schematic representation of a window located at the end of a nucleic acid 
25 sequence; 

Figure 8b is a schematic representation of a window located at the end of a nucleic acid 
sequence showing nucleotides near the end of the nucleic acid sequence; 

Figure 8c is a schematic representation showing the ends of a nucleic acid sequence being 
copied to form a hypothetical extension on each end of the nucleic acid sequence; 
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Figure 8d is a schematic representation of a nucleic acid sequence showing the appended 
hypothetical extensions; 

Figure 9a is a schematic representation of one embodiment of a computer system that can 
implement the methods of the present invention; 
5 Figure 9b is a schematic representation of one embodiment of a computer system that can 

implement the methods of the present invention; 

Figure 10a is a schematic representation of a genomic sequence of DNA with an aligned 
expressed sequence tag aligned thereto; 

Figure 10b is a schematic representation of a window in a region of DNA when the entire 
^k) region is in a known coding region; and, 

(f: Figure 1 Oc is a schematic representation of a window in a region of DNA when part of 

the region is known to be coding, and part of the region is known to be noncoding. 

Detailed Description Of The Invention 

^5 Described herein are methods for determining the state probabilities of one or more 

r; nucleotides in a nucleic acid sequence, the coding strand of a nucleic acid sequence, the extent of 
£3 an open reading frame in a nucleic acid sequence, the location of deletions and insertions in a 
R nucleic acid sequence, the location of exons in a nucleic acid sequence, and the translation of 

those exons. Also described are program storage devices readable by a machine, tangibly 
20 embodying a program of instructions executable by a machine to perform the above methods. 
Also described are computer systems for implementing the above methods, comprising an input 
device for inputting a nucleic acid sequence, a memory for storing the nucleic acid sequence, 
and a processing unit. Also described are computer program products comprising a computer 
usable medium having computer program logic recorded thereon for enabling a processor in a 
25 computer system to perform the above methods. 
Definitions: 

Nucleic Acid Sequence - As used herein, "nucleic acid sequence" includes a nucleic acid 
sequence of any nucleic acid as is generally understood in the art. The nucleic acid can be DNA, 
cDNA, genomic DNA, raw DNA, expressed nucleic acid sequence tags (ESTs), RNA, mRNA, 
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unprocessed RNA, processed RNA, or any other form of nucleic acid, regardless of whether or 
not the nucleic acid actually codes for a protein. 

Nucleic acid sequences can be derived from any natural or artificial source, including 
prokaryotic and eukaryotic organisms, and can be at any stage of processing. 
5 It is understood by those skilled in the art that any representation of a nucleic acid 

sequence is contemplated herein and within the scope of the present invention. That is, while 
conventionally nucleic acid sequences are represented by the nucleotide or base letters A, T, G, 
C, U, any alphanumeric or other representation of nucleotide or base nucleic acid sequence, 
whether digitally represented or otherwise, is within the scope of this invention. Further, nucleic 
Cfl) acid sequence notation indicating uncertainty with respect to the identification of one or more 
m bases in a nucleic acid sequence, for example lUB nomenclature such as R==G and A, Y=T and 
iS C, etc., can be incorporated into the method described herein and is within the scope of this 
I ^ invention. 

Ly Nucleic acid sequences having modified or non-standard bases can be incorporated into 

iL5 the method described herein and are within the scope of this invention. For the purposes of this 

invention, a nucleic acid sequence of "bases" is an equivalent nucleic acid sequence to the 
Q nucleic acid sequence in which the bases are found. 

Reading frame - A "reading frame" is one of the possible phases in which one can read a 
20 sequence of codons (groups of three nucleotides) that can make up a coding region of DNA or 
RNA. In a codon the positions in 5' to 3' order are called the "first", "second", and "third" 
reading frames. 

States - The "states" attributable to a nucleotide are the potential permutations of all of the 
25 possible reading frames and the two nucleic acid strands included in the probability model being 
used. A "+" is used to indicate the positive strand, and "-" to indicate the reverse compliment 
DNA strand. In a preferred embodiment, the possible states of any one nucleotide are positive 
strand first reading frame (1+), positive strand second reading frame (2+), positive strand third 
reading frame (3+), negative strand first reading frame (1-), negative strand second reading frame 
30 (2-), negative strand third reading frame (3-), positive strand noncoding (N+), and negative 
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strand noncoding (N-). In another embodiment, the states can be, for example, just the four 
positive states listed above. Stated symbolically, "f ' is an element in the set of states, i.e. f e 
{1+, 2+, 3+,N+,l-, 2-, 3-, N-}. 

5 Coding State - A "coding state" is any of the states 1+, 2+, 3+, 1-, 2-, or 3-, which indicate 
coding, i.e. nucleic acids translated into protein. 

Noncoding state - A "noncoding state" is either of the states N- or N+, both of v^hich indicate 
noncoding, i.e. no protein translation. 

Clb 

ii Sequentially - "Sequentially" means performing a step or series of steps on nucleotides in order 
J2 as the nucleotides occur in the nucleic acid sequence, in either direction. 

ly State probabilities - The "state probabilities" of a nucleotide within a nucleic acid sequence are a 
il5 vector of probabilities associated with the given nucleotide being in each of the states. 

43 Window - A "window" is a contiguous and defined number of nucleotides within a nucleic acid 
Q sequence. For example, in a nucleic acid sequence having a length of several thousand 

nucleotides, a window of, again for example, 100 nucleotides can be defined for specific analysis 
20 at any place within the larger nucleic acid sequence. 

Middle Nucleotide - The "middle nucleotide" in any given nucleic acid sequence or window is 
the nucleotide found at the numerical middle of the nucleic acid sequence or window, 
respectively, wherein the length of a nucleic acid sequence or window is the total number of 
25 nucleotides in the nucleic acid sequence or window. If the nucleic acid sequence or window has 
an even number of nucleotides, then the middle nucleotide can be either of the two nucleotides 
ajacent the numerical middle of the nucleic acid sequence or window. For example, the middle 
nucleotide in a 101 nucleotide long window is nucleotide number 51, and the middle nucleotide 
in a 100 nucleotide long window can be either nucleotide number 50 or nucleotide number 51. 

30 
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Oligonucleotide - An "oligonucleotide" is a a series of contiguous nucleotides with a defined 
length. 

Initial Oligonucleotide - The "initial oligonucleotide" is the oligonucleotide that occurs at the 
beginning of the nucleic acid sequence or window being examined. Therefore, the first 
nucleotide in the initial oligonucleotide is also the first nucleotide in the sequence or window. 

Transition Probability - A "transition probability" for a given nucleotide is the probability of the 
nucleotide occurring given the oligonucleotide immediately preceding that nucleotide. 

Bias Function - The "Bias Function" is a function that is used to differential alter the 
probability of one or more states of one or more nucleotides in a nucleic acid sequence. For 
example, if a region of the nucleic acid sequence under study is thought to be a coding region, 
then the bias fiinction can be used to increase the calculated probability of the coding states for 
that nucleic acid sequence. 

Bias - "Bias" is a set of one or more values that are used in the Bias Function, and is used to alter 
the probability of one or more states of one or more nucleotides in a nucleic acid sequence. 

Fiher - A "filter" as used herein is any method or algorithm for unifying and making more 
homogeneous regions of a nucleic acid sequence that have been classified in disparate states. A 
filter is used for the purpose of more clearly defining coding region boundaries in a nucleic acid 
sequence. In a method, a step in which a filter is applied is a "filtering step." 

Class - A "class" of nucleotides is a group of nucleotides that are designated as having one state 
for the purposes of filtering. 

Positive Strand and Negative Strand - The terms "positive strand (+)" and "negative strand (-)" 
represent complementary nucleic acid sequences. The sequence in one strand is defined by the 
sequence in the complementary strand. 
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Positive Strand State - A "positive strand state" is any of states 1+, 2+, 3+, N+. 
Negative Strand State - A "negative strand state" is any of states 2-, 3-, N-. 

5 

Description 

The methods described herein can be performed in any manner that allows for the 
analysis of the nucleic acid sequence under study and computation of the probabilities associated 
with that nucleic acid sequence. In a preferred embodiment, the physical nucleic acid sequence, 
for example a DNA sequence having a contiguous nucleic acid sequence of G, C, T, and A 
CP nucleotides, is converted into digital form by, for example, inpxitting the nucleic acid sequence 
fS into a computer system. The computer then processes the nucleic acid sequence using the 

methods described herein. Any nucleic acid sequence referred to herein can be arranged to have 
iy a beginning and an end, and numbered so that the first nucleotide in the nucleic acid sequence is 
LA5 number 1, the next nucleotide in the nucleic acid sequence is number 2, and so on until the end of 
J"5 the nucleic acid seqiience. Any other numbering scheme that is useful can be used. 
C3 The methods shown in Figures 1-7 are independent, and, although several of the methods 

Q described can be utilized together, they can each be performed as independent methods. Further, 
where one method calls for a step in which one of the other methods can be used for that step, the 
20 use of the other method in the step represents only one embodiment, and other methods for 
performing the step can be used as well. 

Any probability model applicable to nucleic acid sequence state probabilities can be used 
for the probability steps if the output of the probability model sufficiently supports the method, 
including inhomogeneous Markov models that have fewer than eight states, for example, those 
25 having only six or four states. In a preferred embodiment, the inhomogeneous Markov model 
has eight states. (For a general discussion of various models, see Durbin, et al, Biological 
Sequence Analysis (1998), which is herein incorporated by reference in its entirety). 

Any nucleic acid sequence source can be used, regardless of the accuracy of the nucleic 
acid sequence relative to the physical molecule it represents, including raw nucleic acid sequence 
30 data and nucleic acid sequence data that has been changed or adjusted for other purposes, such as 
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nucleic acid sequences that have been filtered to improve accuracy, nucleic acid sequences that 
have been altered to account for known mutations, and nucleic acid sequences that have been 
engineered in any maimer whatsoever, among others. Nucleic acid sequence information 
produced by automated nucleic acid sequencers can be used, as well as nucleic acid sequence 
5 information derived by any conventional sequencing technique, such as dideoxy sequencing, 
among others. Nucleic acid sequences produced by or from other bioinformatic processing 
methods or nucleic acid databases can be used, for example, including nucleic acid sequences 
stored in public access databases such as GenBank. Although nucleic acid sequences with any 
amount of error can be used, in a preferred embodiment the amount of sequencing error present 

km is less than about 15%, and more preferably is less than about 10%. However, an advantage of 

m the methods of the present invention is that they can utilize lower quality nucleic acid sequences. 
In this embodiment, the methods of the present invention can utilize nucleic acid sequences 

W where the average sequence accuracy is less than 99%, more preferably less than 95%, more 

id preferably less than 90, 80, or 70%. 

Ip The present invention includes the incorporation of bias into probability models that 

f 3 determine state probabilities for one or more nucleotides. The bias is used to alter the statistical 
Q probability of one or more states for a nucleotide. A bias of zero, for example, will reduce the 
ri probability of a state to zero, while a bias of one will not alter the statistical probability. Values 
greater than one will increase the statistical probability of a state, while values between zero and 
20 one will reduce the statistical probability of a state. Bias can be defined by the investigator in 

order to influence the probability of states. In a preferred embodiment, bias is defined to alter the 
probability of states in a manner consistent with existing knowledge of the nucleic acid sequence 
under study. For example, if a nucleic acid sequence has a region that is strongly suspected to be 
coding, then the nucleotides in that region can be assigned a large bias for the coding states, and 
25 a small bias for the noncoding states. Bias can be incorporated into any conventional statistical 
model that provides a method for determining state probabilities in order to allow for the biasing 
of statistical probabilities in that model. In one embodiment, bias can be defined for each state as 
a number equal to or greater than zero, excluding 1. In this embodiment, the statistical 
probability of a state will be reduced if the bias is set to a number equal to or greater than zero 
30 and less than one, and increased if the bias is set to a number greater than one, and all states are 

20 
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biases in one direction or the other. In another embodiment, bias can be defined as one for one or 
more states, and a number other than one for one or more states. In this embodiment, one or 
more states has a defined bias of one, which resuhs in no biasing of the probabiUty of that state, 
while one or more states have a defined value equal to or greater than zero, excluding one. In 
5 this embodiment, one or more states are biased, and one or more states are not. In a preferred 
embodiment, the bias is between 0.0 and 0.9 or greater than 1.1. 

Figure 1 represents one embodiment of the method of the present invention for 
determining the state probabilities of a single nucleotide within a nucleic acid sequence. The 
nucleotide for which the state probabilities are determined can be any nucleotide in the nucleic 
ftp acid sequence, preferably is a nucleotide close to the middle of the sequence, and in a preferred 
M embodiment the nucleotide is the middle nucleotide in the nucleic acid sequence. It is preferable 

to determine state probabilities for a nucleotide at or near the middle of the nucleic acid 
fU sequence. State probabilities for the nucleotide are determined by first finding the probability of 
n the initial oligonucleotide in the nucleic acid sequence, and then finding the transition 
:i 5 probabilities for the remainder of the nucleotides in the nucleic acid sequence. The initial 
C3 oligonucleotide probability and transition probability information is used to determine tiie 
n probabilities of each of the states for the entire nucleic acid sequence, and the resuhing state 
probabilities are assigned to the nucleotide. Eight states are described below for Figure 1, but 
those of skill in the art will readily see that fewer than eight states can be employed. 
20 Referring now to Figure 1 , in step 12, the probability that the initial oligonucleotide 

occurs in each of the states is determined according to equation I: 



element of the set of states, which, in a preferred embodiment, is {l+,2+,3+,N+,l-,2-,3-,N-}. 



(I) 
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where "a^ , . . a^" is an initial oligonucleotide of length k, a^ is the first nucleotide in the 
oligonucleotide, Nf is the set of all oligonucleotides occurring in the model sample set, and f is an 
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The oligonucleotide length is predefined, and can be any length for which probabilities 
can be reliably generated. Oligonucleotides can be, for example, from 2 to 100 nucleotides, 
preferably 5 to 20 nucleotides, and more preferably from 8 to 12 nucleotides in length. The 
initial oligonucleotide frequencies of all possible oligonucleotides in the model sample set can 
be, for example stored in a look up table, which is accessed as needed. A table defining the 
model sample set can be constructed, for example, by reference to sample nucleic acid sequences 
from a previously examined collection of nucleic acids, preferably from a closely related 
organism, more preferably from the same organism as the nucleic acid sequence under 
investigation. For example, sample nucleic acid sequences from Arabidopsis can be used for a 
table for investigation of nucleic acid sequences of plants such as soybean, maize, etc. Similarly, 
sample nucleic acid sequences from a chimpanzee can be used for a table for investigation of 
nucleic acid sequences of humans. By examining known nucleic acid sequences, model 
oligonucleotide frequencies in each of the states can be determined. A table can include 
indefinite or modified nucleotides, or any other nucleotide variations that occur in nucleic acid 
sequences. Alternatively, it is also possible to use estimation functions in place of such a table of 
probabilities {see, for example, Besemer, J., Borodovsky, M. (1999) Nucl Acids Res., v.27, pp. 
391 1-3920, which is herein mcorporated by reference in its entirety). 

In step 14, the transition probabilities for all nucleotides in the nucleic acid sequence after 
the initial oligonucleotide in each of the states are determined. The transition probability is the 
probability of a nucleotide occuring given the oligonucleotide immediately preceding the 
nucleotide. The transition probability for the first nucleotide transition is set out in equation II: 



\ai...ak+i\f 
iV(ofc+i|ai...afc) = -1 

(II) 



where k is the oligonucleotide length, a, is the first nucleotide in the oligonucleotide, 
"ai...ak" is the initial oligonucleotide, a^+j is the nucleotide immediately following a,,, and f e 
{l+,2+,3+,N+,l-,2-,3-,N-}. Equation II determines the transition probability for the first 
nucleotide following the initial oligonucleotide. After determining the transition probability for 
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the first nucleotide after the initial oligonucleotide, the transition probabilities are determined 



transition probability is determined for the second nucleotide after the initial oligonucleotide 
(%^2) based on the oligonucleotide beginning at the second position, a2, and ending at a^+i. The 
5 process is repeated until the end of the nucleic acid sequence is reached. For example, if the 
oligonucleotide length is ten, then a transition probability for nucleotide eleven is determined 
based on the oligonucleotide comprising nucleotides one through ten. Then, a transition 
probability for nucleotide twelve is determined based on the oligonucleotide comprising 
nucleotides two through eleven, and so on, until the last nucleotide in the nucleic acid sequence 
OO is reached. 

The transition probabilities can be stored in a table, for example. The table can be 
constructed, for example, by reference to sample nucleic acid sequences from a previously 
fy examined portion of nucleic acid, preferably from a closely related organism, more preferably 
Q from the same organism as the nucleic acid under investigation. By examining known nucleic 
f J 5 acid sequences, model transition probabilities in each of the states can be determined. 
C3 In step 16, the probability of the nucleic acid sequence, (S), occurring in each of the states 

O (f) is determined by finding the product of the probability of the initial oligonucleotide and the 
y transition probabilities in each of the states. This step is set forth in equation III for a model 
with eight states: 



sequentially for the remaining nucleotides in the nucleic acid sequence. This means that a 
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PfiS) 



Pf{ai..Mk) ' JJPp(i)fe+i+lK-"i+*) 



(III) 



where the fiinction 




i mod 3 + 1 if/ 
(i + l)mod3-M if/ 
(i -1-2) mod 3+1 if/ 



1± 
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and CO is the length of the nucleic acid sequence, and "a^.-.a^," is the initial oligonucleotide. 

In step 18, the probability of each state for the nucleic acid sequence "P(f |S)" is 
determined given the probability of the nucleic acid sequence, S, in each state. A bias function, 
(l>(f), is incorporated into the equation to account for known nucleic acid sequence information. 
This step is set forth in equation IV: 



Pif\S) = m-PrPfiS) 



no wherein Pf is ^ for each coding state (1+, 2+, 3+, 1-, 2-, 3-) and ^ for each noncoding 

M= state (N+, N-). The bias function is used to modify these default Fyvalues. By modifying the 
hi default values, the investigator can account for known nucleic acid sequence features. For 
b; example, if another bioinformatics process has indicated that there is a high probability that a 
Q certain portion of a nucleic acid sequence comprises a gene, then it would be advantageous to 
15 bias the state probabilities in favor of the coding states. The resulting state probabilities 

produced by the method will reflect the bias through stronger probabilities of the coding states 

relative to the noncoding states. 

If, for example, the nucleic acid sequence is known to be a coding nucleic acid sequence, 

the bias function can be defined by equation V: 



(V) 



r 1 if/5^iv± 



Equation V uses a bias of 1 for all coding states, and a bias of 0 for all noncoding states. 
The net effect will be to cause the probability of the sequence in each noncoding state to drop to 
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zero, while leaving the probability of the sequence in the coding states unaffected. Application 
of equation IV then leads to a decrease of the probabilities of the noncoding states to zero, while 
increasing the probabilities of the coding states. 

If the nucleic acid sequence is known to be a noncoding nucleic acid sequence, then the 
bias function can be defined by equation VI: 



Equation VI reverses the effect of equation V. Of course, the bias function does not need 
to be binary in nature, as is shovra in the above two examples, but rather can be defined in any 
manner that corresponds with knovm nucleic acid sequence data. A principal feature of this 
technique is that it can be used to specifically combine gene prediction information from other 
sources into biasing the results of the state probabilities algorithm shown in Figure 1 (and 
subsequent gene prediction based thereon). 

The resulting values for the probability of each state for the nucleic acid sequence can 
now be associated with the nucleotide for which state probabilities were being determined. 

In a further embodiment of the method shown in Figure 1, the nucleic acid sequence is 
part of a larger nucleic acid sequence. This embodiment can be applied to any of the methods 
described herein wherein a nucleic acid sequence is used, including those represented in Figures 
1 through 7. 

Figure 1 shows the determination of state probabilities for a single nucleotide in a nucleic 
acid sequence. Oftentimes, however, it will be desirable to determine the state probabilities for 
more than one nucleotide in a nucleic acid sequence. 

Figure 2 represents the application of the method shown in Figure 1 to multiple 
nucleotides in a nucleic acid sequence. In order to determine the state probabilities for more than 
one nucleotide, a window is used for each nucleotide that is examined. The nucleotide that is 
being examined is within the window, and the probability determinations set out in equations I, 
II, III, and IV are performed for the sequence in the window. The oligonucleotide probabilities 



(VI) 
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are determined as before for the nucleic acid sequence within the window, probabilities for each 
of the states are determined for the nucleic acid sequence within the window, and those 
probabilities are assigned to the nucleotide within the window for which state probabilities are 
being determined, which, in a preferred embodiment, is the middle nucleotide. Another 
5 nucleotide is then examined, with the window shifted or redefined around the new nucleotide, 
and so on, until the final nucleotide in the nucleic acid sequence for which state probabilities are 
to be determined is reached. 

In steps 22, 24, 26, and 28, probabilities are determined as in steps 12, 14, 16, and 18 
respectively, with the window in steps 22, 24, 26, and 28 corresponding to the nucleic acid 
% sequence in steps 12, 14, 16, and 18 respectively for the purposes of those steps. At step 28, the 
Cn state probabilities for the nucleotide for which state probabilities are being determined are 
f Q associated with that nucleotide. 

I In step 30, the algorithm checks to see if the state probabilities for the last nucleotide 

^ have just been determined. If yes, flow proceeds to step 32 and ends. If in step 30 the last 
m nucleotide has not been reached, flow proceeds to step 34, where the next nucleotide for which 

state probabilities are to be determined is designated as the nucleotide to analyze in steps 22, 24, 
3 26, and 28. After step 34, flow retums to steps 22, 24, 26, and 28, where the state probabilities 
Q of the designated nucleotide are determined. At step 34 any nucleotide from the remaining 

nucleotides that have not yet had state probabilities determined can be designated the next 
20 nucleotide. 

In a preferred embodiment, the first nucleotide to be examined in step 22 is the first 
nucleotide in a contiguous nucleic acid sequence of nucleotides for which state probabilities are 
to be determined, each subsequent nucleotide at step 34 is the next nucleotide of the contiguous 
nucleic acid sequence of nucleotides for which state probabilities are to be determined, and the 
25 last nucleotide in step 30 is the last nucleotide in the contiguous nucleic acid sequence of 
nucleotides for which state probabilities are to be determined. 

The window size can be the same or different for each nucleotide, and the nucleotide can 
be located anywhere within its window. In a preferred embodiment, the window size is the same 
for each nucleotide in the nucleic acid sequence, and each nucleotide is the middle nucleotide in 
30 its own window. In one embodiment, windows are from 3 nucleotides to 1,000 nucleotides in 
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length, preferably 50 to 200 nucleotides in length, and more preferably from 75 to 125 
nucleotides in length. 

The result of the process shown in Figure 2 is the association of state probabilities with 
each individual nucleotide for which state probabilities were determined. In one embodiment, 
5 the nucleotides for which state probabilities are to be determined are a contiguous nucleic acid 
sequence of nucleotides within a longer nucleic acid sequence of nucleotides. 

Figures 3 through 7 all utilize probability models to determine state probabilities. Any 
probability model that allows for determination of the required probabilities in a plurality of 
states can be used, with use of an inhomogeneous Markov model preferred, and use of the 
OS) inhomogeneous Markov model described above in reference to Figure 2 especially preferred. 
m Figure 3 represents one embodiment of a method for determining the coding strand of a 

;;i nucleic acid sequence. The process determines the state probabilities for each nucleotide in the 

nucleic acid sequence, sums the positive states for the nucleic acid sequence, and sums the 
Ly negative states for the nucleic acid sequence. If the sums for the positive states and the negative 
i 15 states are sufficiently different, then the process determines that the state with the greater sum is 
^^^J the coding strand. 

a In step 38, state probabilities are determined for each nucleotide in the nucleic acid 

n sequence for which the coding strand is being determined. In one embodiment, state 

probabilities are determined using the inhomogeneous Markov model described above in 

20 reference to Figure 2. 

In step 40, the probability of each state determined in step 38 for the positive states (1+, 
2+, 3+, and N+) for each nucleotide in the nucleic acid sequence for which the coding strand is 
being determined are summed. That is, the values for the states of noncoding, positive and 
coding, positive in the first, second, and third reading frames for all nucleotides in the nucleic 

25 acid sequence for which the coding strand is being determined are summed. The sum is set to 
the arbitrary variable X. 

In step 42, the values determined in step 38 for the negative states (1-, 2-, 3-, N-) for each 
nucleotide in the nucleic acid sequence for which the coding strand is being determined are 
summed. That is, the values for the states of noncoding, negative and coding, negative in the 

30 first, second, and third reading frames for all nucleotides in the nucleic acid sequence for which 
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the coding strand is being determined are summed. The sum is set to the arbitrary variable Y. 
Steps 40 and 42 can be performed in reverse order. 

In step 44, a function of X and Y is used to determine whether the state probabilities 
indicate sufficient coding on one strand of the nucleic acid sequence. That is, it is determined 
whether f(X,Y)<T, where T is a defined threshold value. Any function can be used that allows 
for the desired discrimination. In one embodiment, the function used in step 44 is 



preferably is about 0.25 to about 0.75, and even more preferably is about 0.4 to about 0.6. If in 
step 44 the function results in a value that is less than the threshold value, T then flow proceeds 
to step 46, where it is determined that coding is mixed or is not detectable. If in step 44 the 
function results in a value that is equal to or greater than the threshold value, T, then flow 
proceeds to step 48. 

In step 48, it is determined on which strand coding occurs. A function of X is compared 
to a function of Y to determine which strand is coding. Any two functions that allow for the 
proper comparison can be used, including functions that weight one of the two strands. In one 
embodiment, / (X) = X and /(7) = 7 , and the comparison in step 48 simply determines which 
sum is greater. If in step 48 the function of X is found to be greater than the function of Y, then 
flow proceeds to step 50 where it is determined that coding is on the positive strand. If in step 48 
it is determined that the function of X is not greater than Y, then flow proceeds to step 52, where 
it is determined that coding is on the negative strand. 

In another embodiment of the method represented by Figure 3, steps 44 and 46 can be 
removed for situations in which it is already known or suspected that coding is present and only 
on one strand. In this embodiment, flow begins at step 38 and, after executing step 42, flow 
proceeds directly from step 42 to step 48. 

Figure 4 represents one embodiment of a method for determining the extent of an open 
reading frame (ORF) within a nucleic acid sequence. The process determines the extent of the 
open reading frame by first determining the state probabilities for each nucleotide in the nucleic 
acid sequence. Then, beginning from within the nucleic acid sequence, preferably the 
approximate middle of the nucleic acid sequence, and proceeding toward one end of the nucleic 



\X-Y\ 



When /(X,7) = 



\X-Y\ 



, the value of T is about 0.1 to about 0.9, 
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acid sequence, the process examines each nucleotide in turn and determines whether the 
nucleotide is sufficiently likely to code. When a sufficient number of nucleotides with an 
insufficient likelihood of coding are encountered, the process determines that one end of the open 
reading frame has been found. The process then repeats from the middle to the other end of the 
5 nucleic acid sequence in order to find the second end of the open reading frame. 

In step 56, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are determined. As stated above, any probability model that has the correct form of output can 
be used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov 
model described above and represented in Figure 2 most preferred. 
m In step 58, the coding strand of the nucleic acid sequence is determined and designated 

[S "S." Any algorithm or method that can use the state probabilities produced in step 56 can be 
J^J used, and in a preferred embodiment, the method described above and represented in Figure 3 is 
^ U used. If coding strand is indeterminate, an error can be returned at this step and processing does 
iy not continue. In applications where the coding strand is already known or suspected, step 58 can 
hs be omitted from the process, in which case step 56 can flow directly to step 60. 

In step 60 an arbitrary variable, L, is set to half of the length of the nucleic acid sequence, 
C3 S, which designates L the middle nucleotide (determination of the middle for even and odd 
Fi sequences is done as described above for the middle nucleotide). In an alternative embodiment, 
L can initially be set to any nucleotide in the nucleic acid sequence. It is preferred, however, to 
20 begin with L relatively close to the middle of the putative ORF, because proper resolution of the 
ends of the ORF is then more likely. 

Steps 62, 64, and 66 effectively search through the nucleic acid sequence in a descending 
direction from L toward the first nucleotide in the nucleic acid sequence for one of the ORF ends. 
In step 62, the sum of the probabilities of the coding states on the strand S - that is the set 
25 2+, and 3+) or the set (1 -, 2-, and 3-) depending on whether strand S is the positive or negative 
strand for nucleotide L is determined and compared to threshold value T'. In an alternative 
embodiment, the probability of all six coding states 2+, 3+, 1-, 2-, and 3-) can be combined. 
If the sum of the coding states is greater than or equal to a threshold value, T', and the nucleotide 
is greater than the first nucleotide in the nucleic acid sequence (that is, L>1), then L is set to L-1 
30 and P, an arbitrary counting variable, is set to L-1. In one embodiment, the value of T is about 
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0.1 to about 0.9, preferably is about 0.25 to about 0.75, and even more preferably is about 0.4 to 
about 0.6. 

Flow then proceeds to step 64. If the sum of the coding states, as discussed above, is less 
than T' and P is greater than 1, then P is set to P-1 . The effect of the two steps, 62 and 64, is to 
5 reduce both L and P at the same rate if the sum of the coding states is greater than or equal to T', 
or to reduce P but not L if the sum of the states is less than T'. 

After step 64, flow proceeds to step 66, where it is determined if L-P>T" or P=l . If L- 
P>T", wherein T" is a threshold value, then a gap between the last nucleotide (L) with a sufficient 
sum of coding states and the current nucleotide being examined has increased beyond the 
y threshold value T". T" can be set to any number that allows for the proper gap of noncoding 
£n nucleotides. T" should be larger than the maximvim expected length of an intron for the nucleic 
CO acid sequence. This number will depend in large part on the model sample set being used. If the 
number for T" is set too low, then a relatively lengthy intron will be sufficient to fix L at the end 
W of an exon that is not at the end of the ORF. If P=l , then the end of the sequence has been 
m reached. In one embodiment, T" is about 10 to about 20,000 nucleotides, preferably about 50 to 
u\ about 10,000 nucleotides, and more preferably about 500 to about 700 nucleotides. 
^ If neither condition in step 66 is met, then flow returns to step 62 and loops through steps 

Q 64 and 66 until one of the conditions in step 66 is met, at which point flow proceeds to step 68. 

Steps 68, 70, 72, and 74 check for the end of the ORF in the ascending direction, and perform the 
20 same function as steps 60, 62, 64, and 66 but in the opposite direction. 

In step 68, M is set to the middle nucleotide. As above for L, this value can be altered in 
altemative embodiments. In step 70, the sum of the coding states, as above, is compared to T', 
and M is compared to the length of the nucleic acid sequence. If the sum of the coding states of 
nucleotide M is greater than or equal to T' and M is less than the length of the nucleic acid 
25 sequence, then M is set to M+1 and Q is set to M+1 . Flow proceeds to step 72, where, if the sum 
of the coding states is less than T' and Q is less than the length of the nucleic acid sequence, then 
Q is set to Q+L Flow proceeds to step 74, where it is determined if Q-M>T", or Q> length of the 
nucleic acid sequence. If either is true, then flow proceeds to step 76, where the ORF is 
determined to extend from nucleotide L to nucleotide M. If in step 74 neither condition is true, 
30 then flow loops to step 70. 
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In an alternative embodiment, different threshold values can be used in place of T' and T" 
for the second loop, which comprises steps 70, 72, and 74. Different threshold values for steps 
62, 64, and 66 versus steps 70, 72, and 74 could be desirable if, for example, one end of an ORF 
was known or suspected to be degraded to some extent. 
5 Figure 5 is a flowchart representing one embodiment of a method for determining the 

location of deletions and additions within a nucleic acid sequence. The process first determines 
the state probabilities for each nucleotide in the nucleic acid sequence. Then the process 
determines whether in the window around a specific nucleotide the most likely state for the 
nucleic acid sequence on one side of the specific nucleotide is different from the most likely state 
ft for the nucleic acid sequence on the other side of the specific nucleotide. If so, the process 
Cm determines whether a hypothetical insertion or deletion at the specific nucleotide would 
m sufficiently improve the state probabilities of the entire nucleic acid sequence in the window. If 
: 7 so, then an insertion or a deletion is indicated. 

hi In step 78, the state probabilities of each of the nucleotides in the nucleic acid sequence is 

1:1^5 determined. As stated above, any probability model that has the correct form of output can be 
}:1 used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
O described above and represented in Figure 2 most preferred. 

n In step 80, the first nucleotide is designated as "Z," and the size of a window, W, is set. 

In step 82, the probabilities of each of the states of the nucleotides between Z and the midpoint of 
W 

20 the window Z+— are averaged, and the state with the greatest average is set to "A" (windows 

with an even or odd number of nucleotides are treated as above for the middle nucleotide with 

W 

respect to determination of "A" is effectively the most likely state of the first half of 
window W. 

In step 84, the probabilities of the states of the nucleotides between the midpoint of the 

25 window Z+— and the end of the window, Z+W, are averaged, and the state with the greatest 
2 

average is set to B. B is effectively the most likely state of the second half of window W. 

In step 86, the most probable states, A and B, are checked to see if they are each a coding 
state and not the same coding state. If both A and B are coding states and they are not the same 
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w 

coding state, then flow proceeds to steps 88, 90, and 92, where the nucleotide at Z+ y is 

examined further. If, in step 86, A and B are the same coding state, or if one of the two is most 
probably a noncoding state, then flow proceeds to 96, where it is determined if Z is greater than 

W 

the length of the nucleic acid sequence minus y . If so, then flow proceeds to step 98, and the 

W 

5 process ends. If, in step 96, Z is not within a distance of — of the end of the nucleic acid 

sequence, then flow proceeds to step 100, where Z is increased by one. Flow then loops to step 
82. 

C4 If in step 86 if it was determined that both conditions were met, then flow proceeds to 

. W 

^ steps 88 through 92 to determine if either a deletion or an addition occurred at nucleotide 

In step 88, a hypothetical average of state probabilities for state A for the entire window, 

nucleotides Z to Z+W, for an insertion is determined. The hypothetical average of state 

N= W 

n probabilities for state A is determined for the window as if the nucleotide at Z+ — is removed. 

Q The probabilities of state A of the nucleotides in W are averaged to obtain the hypothetical 
a average state probabilities for state A for the entire window, and the value is set to N. In step 90, 
15 a hypothetical average of state probabilities for state A for the entire window, nucleotides Z to 
Z+W, for a deletion is calculated similarly. The hypothetical average of state probabilities for 
state A in step 90 is determined and set to M for the window as if a nucleotide has been added on 

W 

one side or the other of the nucleotide at Z+-y . By averaging the state probabilities of all of the 

nucleotides in the window for either an insertion or a deletion, the values of N and M reflect the 
20 likelihood that either an insertion or a deletion has taken place. In steps 88 and 90, in an 
alternative embodiment, state B can be used in place of state A to achieve a similar result. 

In step 92, the larger of M and N is compared to the sum of the probabilities of the states 

W 

indicating coding (1+, 2+, 3+, 1-, 2-, and 3-) of the nucleotide at Z+y . If m step 92 neither M 

W 

nor N is greater than the sum of the probabilities of the coding states of the nucleotide at Z=-— , 
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then it is determined that no insertion or deletion has taken place and flow proceeds to step 96. If 
in step 92 either M or N is greater than the sum of the probabilities of the coding states of the 

nucleotide at Z=— , then it is determined that an insertion or a deletion has taken place, and flow 
2 

proceeds to step 94. 

5 In step 94, a deletion is indicated if N is greater than M, and an insertion is indicated if N 

is not greater than M, and flow then proceeds to step 96. 

Figure 6 is a flow chart representing one embodiment of a method for determining the 
location of one or more exons within a nucleic acid sequence and the protein translation of those 
[3 exons. The process begins by determining the state probabilities for each nucleotide in the 
15 nucleic acid sequence, the coding strand, and the extent of the open reading frame. The process 
^3 then classifies each nucleotide according to its most probable state. Filters, which reclassify 
f y nucleotides in a defined manner in order to make local blocks of the nucleic acid sequence 

consistent, are then applied to the nucleic acid sequence. Regions of the nucleic acid sequence 
I ^ that are in any of classes 1 , 2, or 3 are then designated as exons, and the exons are translated. 
Sft Translation is accomplished by using the universal genetic code to convert the nucleic acid 
Q sequence of the designated exons into the corresponding amino acid sequence based on the 
Si reading frame of the class. That is, exons in class 1 will be translated in reading frame 1, exons 
in class two will be translated in reading frame 2, and exons in class 3 will be translated in 
reading fi*ame 3. The translation is linearly arranged to correspond to the linear arrangement of 
20 the exons along the nucleic acid sequence. 

In step 102, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are determined. As stated above, any probability model that has the correct form of output can be 
used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
described above and represented in Figure 2 most preferred. In step 104, the strand and the 
25 extent of the open reading frame is determined. Any method for determining the strand and the 
extent of the ORF that can use the state probabilities generated in step 102 can be used, and in a 
preferred embodiment, the methods described above and represented in Figures 3 and 4 can be 
used for such determination. 



33 




In step 106, the nucleotides in the nucleic acid sequence are categorized as the highest 
probability state as determined in step 102. For example, in a model having four states for each 
nucleic acid strand, each nucleotide is categorized as 1, 2, 3, or N. 

In step 108, which is optional, one or more filters are applied to the nucleic acid sequence 
5 in order to group adjacent nucleotides by class. Any filter that converts portions of the nucleic 
acid sequence with inconsistent nucleotide classification to a more homogeneous state can be 
used. The net effect of the application of one or more filters to the nucleic acid sequence 
classification in step 104 will be to group adjacent nucleotides and blocks of nucleotides into the 
same coding classification, thereby making exon and introns more uniform, and exon and intron 
^ boundaries more evident. 

m In step 110, the fihered nucleic acid sequence is analyzed for exons. Any contiguous 

m regions with coding classes of 1, 2, or 3 are determined to be exons. Once each exon has been 
I ^ identified, the exons can be translated using the universal genetic code, and a resulting protein 

1-.:™ 

iy sequence derived. 

11^5 Figure 7 is a second embodiment of the method described above and represented in 

H Figure 6, with explicit filtering steps detailed therein. In Figure 7, steps 102, 104, 106, and 1 10 
Q are the same as those described above and shown in Figure 6. In Figure 7, after step 106, steps 
O 112, 114, 116, 118, 120, 122,and 124 are filter steps that are applied to the categorized nucleic 

acid sequence produced in step 106. The order shown for the filter steps, 1 12, 1 14, 1 16, 1 18, 
20 120, 122, and 124, can be rearranged to occur in any order in the process, and any combination 

of the steps can be used, including combinations that omit one or more of the filtering steps. 

In step 112, any noncoding nucleotide flanked by two nucleotides with the same class is 

reclassified into the class of the two flanking nucleotides. For example, 1,N,1 would be 

converted to 1,1,1. 

25 In step 1 14, any nucleotide that is flanked by two pairs of adjacent nucleotides all with 

the same class is reclassified into the class of the flanking nucleotides. For example, 1,1,2,1,1 
would be converted to 1,1,1,1,1. 

In step 1 16, any adjacent nucleotide pair having the same class that is flanked by two 
pairs of adjacent nucleotides all with the same class is reclassified into the class of the flanking 

30 nucleotides. For example, 1,1,2,2,1,1 would be converted to 1,1,1,1,1,1. 
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In step 1 1 8, any adjacent nucleotide pair having the same class that is flanked by two 
nucleotides with the same class is reclassified into the class of the flanking nucleotides. For 
example, 1,2,2,1 would be converted to 1,1,1,1. 

In step 120, any nucleotide flanked by two nucleotides with the same class is reclassified 
5 into the class of the flanking nucleotides. For example, 1 ,2, 1 is converted to 1,1,1. 

In step 122, any contiguous, noncoding nucleotide region with an insufficient length is 
reclassified into the class of the flanking coding regions. An insufficient length is any length that 
is too small to be an intron. This length will be dependent in large part upon the particular 
nucleic acid sequence under study. In one embodiment, a length of about 10 to 50, preferably 
11 about 20 to 40, and more preferably about 25 to 35 nucleotides in length is used. The size of the 
m noncoding nucleotide length required can, in alternative embodiments, be changed as appropriate 

to better suit examination of the nucleic acid sequence imder study. In step 122, the 
I ^ classification of the flanking regions of coding nucleotides can be extended into the noncoding 
ly regions an equal amount on either side, an unequal amount on either side, or entirely on one side 
U:5 or the other. 

In step 124, any coding region (i.e. a region with nucleotides of classes 1, 2, or 3, 
Q comprising more than one nucleotide classification) is reclassified as the most common class in 
p that coding segment. 

Flow proceeds to step 110, where the filtered nucleic acid sequence is analyzed for exons. 
20 Any contiguous regions with nucleotides of classes 1, 2, or 3 are determined to be exons. Once 
each exon has been identified, the exons can be translated using the universal genetic code, and a 
resulting protein sequence derived. 

While performing the methods described above in Figures 1-7, windows can sometimes 
extend past the end of a sequence. Conventional applications that use window-based probability 
25 models for multiple nucleotides, such as the windows described above, are limited in their 
application at the ends of nucleic acid sequences. Since coding probability can be calculated 
using a window that is centered on each nucleotide of a nucleic acid sequence in turn, a window 
can extend beyond an end of a sequence. Figure 8a schematically represents a nucleic acid 
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sequence 200 with a window 204 of length "W." As shown in Figure 8a, the window 204 is 
W 

empty for the first — bases at an end 206 of the sequence 200. 

As shown in Figure 8b, the present invention remedies this problem by using the local 
nucleic acid sequence 216 at the end 206 of the nucleic acid sequence 200 as a source for 
5 hypothetical nucleotides added on to the end 206 the nucleic acid sequence 206. As shown in 
Figure 8c, a copy 218 of the local nucleic acid sequence 216 can be created. As shown in Figure 
8d, the copy 218 can then be appended onto the end 206 to form a hypothetical nucleic acid 
sequence extension. As shown in Figure 8d, the window 204 is now filled with nucleotides from 
C3 the nucleic acid sequence 200 and the hypothetical nucleic acid sequence extension 218, which 
|i allows for probability determination within the window 204. As shown in Figures 8b, 8c, and 
8d, the same process can be performed on the other end of the sequence at the same time. Any 
W number of nucleotides can be copied and added in this manner in order to provide the correct size 
111 window. In a preferred embodiment, the number of nucleotides copied is a multiple of three, 
f , For example, if a 100 nucleotide window is desired for the first nucleotide in the nucleic acid 
P sequence, the first 51 nucleotides of the nucleic acid sequence can be copied to form a 
[3 hypothetical 51 nucleotide extension. When state probabilities are determined for the first 

nucleotide, the 51 appended nucleotides are used to fill the first half of the window. The same or 
different nucleotides can be copied and used in a similar manner for any other nucleotides 
without a sufficient window. This process can be repeated for the other end of the nucleic acid 
20 sequence, of course, as needed. The copied nucleotides can be appended in either orientation on 
the end of the nucleic acid sequence. 

Implementation : 

A computer system capable of carrying out the functionality and methods described 
above is shown in more detail in Figure 9a. A computer system 702 includes one or more 
25 processors, such as a processor 704. The processor 704 is connected to a communication bus 
706. The computer system 702 also includes a main memory 708, which is preferably random 
access memory (RAM). Various software embodiments are described in terms of this exemplary 
computer system. After reading this description, it will become apparent to a person skilled in 
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the relevant art how to implement the invention using other computer systems and/or computer 
architectures. 

In a further embodiment, shown in Figure 9b, the computer system can also include a 
secondary memory 710, The secondary memory 710 can include, for example, a hard disk drive 
5 712 and/or a removable storage drive 714, representing a floppy disk drive, a magnetic tape 

drive, or an optical disk drive, among others. The removable storage drive 714 reads from and/or 
writes to a removable storage unit 718 in a well knovra manner. The removable storage unit 718, 
represents, for example, a floppy disk, magnetic tape, or an optical disk, which is read by and 
written to by the removable storage drive 714. As will be appreciated, the removable storage 
© unit 718 includes a computer usable storage medium having stored therein computer software 
m and/or data. 

In alternative embodiments, the secondary memory 710 may include other similar means 

I for allowing computer programs or other instructions to be loaded into the computer system. 

ly Such means can include, for example, a removable storage unit 722 and an interface 720. 

1.^ Examples of such can include a program cartridge and cartridge interface (such as that found in 
video game devices), a removable memory chip (such as an EPROM, or PROM) and associated 

O socket, and other removable storage units 722 and interfaces 720 wiiich allow software and data 

Q to be transferred from the removable storage unit 722 to the compxiter system. 

The computer system can also include a communications interface 724. The 

20 communications interface 724 allows software and data to be transferred between the computer 
system and extemal devices. Examples of the communications interface 724 can include a 
modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot 
and card, etc. Software and data transferred via the communications interface 724 are in the form 
of signals 726 that can be electronic, electromagnetic, optical or other signals capable of being 

25 received by the communications interface 724. Signals 726 are provided to communications 
interface via a channel 728. A channel 728 carries signals 726 in two directions and can be 
implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and 
other communications channels. In one embodiment, the channel is a connection to a network. 
The network can be any network known in the art, including, but not limited to, LANs, WANs, 

30 and the Internet. Nucleic acid sequence data can be stored in remote systems, databases, or 
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distributed databases, among others, for example GenBank, and transferred to computer system 
for processing via the network. In a preferred embodiment, nucleic acid sequence data is 
received through the Internet via the channel 728. Nucleic acid sequences can be input into the 
system and stored in the main memory 708. Input devices include the communication and 
5 storage devices described herein, as well as keyboards, voice input, and other devices for 

transferring data to a computer system. In a further embodiment, nucleic acid sequences can be 
generated by an automatic sequencer, for example any that are known in the art, and the 
implementations described herein can be incorporated within the automatic sequencer device in 
order to directly use the output of the automatic sequencer. 
^ In this document, the terms "computer program medium" and "computer usable medium" 

ff^ are used to generally refer to media such as the removable storage device 718, a hard disk 
[0 installed in hard disk drive 712, and signals 726. These computer program products are means 
for providing software to the computer system. 

Computer programs (also called computer control logic) are stored in the main memory 
W5 708 and^r the secondary memory 710. Computer programs can also be received via the 

communications interface 724. Such computer programs, when executed, enable the computer 
if system to perform the features of the present invention as discussed herein. In particular, the 
a computer programs, when executed, enable the processor 704 to perform the features of the 

present invention. Accordingly, such computer programs represent controllers of the computer 
20 system. 

In an embodiment where the invention is implemented using software, the software may 
be stored in a computer program product and loaded into the computer system using the 
removable storage drive 714, the hard drive 712 or the communications interface 724. The 
control logic (software), when executed by the processor 704, causes the processor 704 to 
25 perform the fiinctions of the invention as described herein. 

In another embodiment, the invention is implemented primarily in hardware using, for 
example, hardware components such as application specific integrated circuits (ASICs). In one 
embodient incorporating ASIC technology, a self-contained device, which could be hand-held, 
has integrated circuits specific to perform the methods described above without the need for 
30 software. Implementation of such a hardware state machine so as to perform the functions 
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described herein will be apparent to persons skilled in the relevant art(s). In yet another 
embodiment, the invention is implemented using a combination of both hardware and software. 

The following examples are illustrative only. It is not intended that the present invention 
be limited to the illustrative embodiments. 

5 

EXAMPLE 1 

Referring now to Figures 10a, 10b, and 10c, examples of biasing are shown. Figure 10a 
shows a portion of genomic DNA 300. Aligned with the genomic DNA 300 is an expressed 
sequence tag (EST) 302. The EST 302 comprises coding regions 304 and noncoding regions 
ft 306. In Figure 10b a window 308 of nucleotides is examined. The window 308 is positioned on 
m the genomic DNA 300 that corresponds to a known coding region 304 on the EST 302. The a 
m priori probability of coding is said to be 100% over that window 308 and a bias is applied 

accordingly. In Figure 10c, a different window 310 straddles the intron-exon boundary, and the 
iy a priori probability of coding is said to be 1 00% for the nucleotides in the window 3 1 0 that 
m correspond to the coding region 304 of the EST 302, while the a priori probability of coding is 
"i said to be 0% for the nucleotides in the v^ndow 310 that correspond to the noncoding region 306 
Q of the EST 302. 

p Bias is applied to the two different situations shown in Figures 10b and 10c as follows. 

The general equation for the probability of the sequence S = ai...a^ of a Markov process of order 
20 n is shown in Equation VII: 



This equation is based on an inhomogeneous Markov model, whereby the initial and 
25 transitional probabilities are dependent on the periodic state of the sequence (as in a hidden 
Markov model with fixed state transition probabilities). In this model, initial and transition 
probabilities are dependent on the sequence orientation and phase in which the sequence is read 
relative to the codons in the coding portion of the nucleic acid sequence. Thus, equation VIII is 
used: 
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w—n 



Pf(S) = P/(ai...a„) • JJ Pir(^i^^)(an+i\ai...a„+i-i) 



(VIII) 



where, given a state a e {l+,2+, 3+, N+, 1-, 2-, 3-, N-} representing the possible states 
for reading the sequence, wherein ... 



i mod 3 + 1 if / = 1* 

jpfs^^j (i+l)mod3 + l if/-2+ 

S + 2) mod 3 + t if / = 3=*= 

AT if / 3. iV± 

(IX) 



Equation X is used to apply Bayes' rule to determine the probability that the sequence S is 
in state a: 



P{a\S) = ' ^''^^^ 



S Pi ■ PiiS) 

(X) 



A bias function is added to equation X in order to allow for biasing of regions of DNA for 
which coding information is available. The bias function is incorporated in equation XI: 



p, ... <l>ia) ' P. ■ PAS) 

P{a\S) = 



(XI) 



£ ^{o) ■ Pi • PiiS) 

i6{l + ,2+,3+,JV+,l-,2-,3-,JV-} 
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Equation XI can be applied to the hypothetical region of DNA shown in the window 308 
in figure 10b. Since the entirety of the sequence in the window 308 lies in a coding region (as 
determined with the EST 302), a bias function 0(0) can be defined according to equation XII: 



(XII) 



p.,|c^_/ 1 ifaG{l+,2+,3+} 



which reflects that we know with 100% certainty that the sequence segment must be 
coding in one of the thee direct reading fi"ames, but that we do not know which. In this case, 
Hh since 0(0) = 0 where a e {N+, 1-, 2-, 3-, N-}, equation XII can be written as equation XIII: 



r 0 



P{a\S) = { 



(XIII) 



T -1 



£ Pi-m 

.•e{i+,2+,:i+} 



if<7e{i-,2-,3-,iV+,N-) 

ifcre{l+,2+,3+} 



1 5 Because = P2+ = P3+ (since the EST does not indicate any difference in probability 

among the three reading frames), equation XIII can be simplified as shown in equation XIV: 



P{a\S) = 



Pa{S) 



(XIV) 



-1 -1 



ie{i+,2+,3+} 



20 



The function 0(0) results in a coding potential (equation XIV) substantially different than 
the unbiased coding potential function (shown by equation X). In this example, the chosen bias 
function reduces the probability of the evaluated window 308 to zero in all but the three plus- 
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strand coding states. This effectively forces the window to be evaluated as coding in one of the 
positive coding states, while not biasing the probability of those states relative to each other (e.g., 
P 

is the same with or without the bias function whereas —may differ). 

Figure 10c illustrates a window 310 wherein the evaluated sequence straddles an exon- 
intron boundary as indicated by the EST 302. A possible function 0(0) for this situation would 
be to expand equation XII to equation XIII: 



(XIII) 



e ifae {l+,2+,3'-} 
P{a\S) = { 1-e ii(j€{N+,N-} 
0 if 6 {l-,2-,3-} 



where e represents the fraction of bases in the part of the sequence in the window that lies 
in the coding region of the DNA 300 as indicated by the coding region 304 of the EST 302. If 
equation XIII is put into equation IX, equation XIV results: 



P(a\S) = { 



15 (XIV) 



0 



(\-e)-Pa-P,(S) 



^ #-).P,.Pi(S) 

je{l+,2+,3+,JV+,iV-} 



^{i)-P,-Pi{S) 

ie{1 + ,2+,3+,JV+,N-} 



if rr€{Ar+,Ar-} 



where P^= - for g g {N+, N-} and - for a g 2+, 3+} (given the assumption that 
4 6 

coding and noncoding are equiprobable events, each coding state is equiprobable with any other 

20 coding state, and that both noncoding states are equiprobable, - x 2 = ^ and j x 3 = ^ ). 

4 2 6 2 
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EXAMPLE 2 

The following example illustrates the computations involved in probability calculations 
for a sequence with and without a bias applied. The nucleotide sequence GATGACATT is used 
in this example for clarity and simplicity, but it is understood that longer sequences as indicated 
5 above can be used. Further, for this example, a zero order inhomogeneous Markov model is 
used. In this model, the initial probabilities are all 1 and each event is independent of that which 
precedes it (aj...ak becomes N— > a^ because k is zero). Models of higher order can be used, 
as described above. 

Accordingly, the following hypothetical table of probabilities is used: 







Direct (+) 


Reverse (-) 








1+ 


2+ 


3+ 


1- 


2- 


3- 


N± 




T 


0.13 


0.2 
7 


0.13 


0.10 


0.25 


0.21 


0.20 




C 


0.28 


0.2 
6 


0.39 


0.39 


0,21 


0.38 


0.30 




A 


0.21 


0.2 


0.09 


0.13 


0.27 


0.13 


0.21 








6 














G 


0.38 


0.2 
1 


0.39 


0.38 


0.26 


0.28 


0.29 



Without a bias function 0(0) to incorporate known information in the calculations, P(S|o) 
can be calculated for the zero order case for the sequence GATGACATT according to equations 
XV through XXI. 

15 
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P(GATCACATT|1+) = P{N) - Pi^{G\N) ■ P2+{A\N) - P3+{T\Ny 

PMG\N)-P2+{A\N)-Ps^{C\Ny 
Pi*iA\N)-P2^iT\N)-P3+{T\N) 

= PMG) P2AA) ■ PsATy 
PMG)'P2-^{A)-PMC)- 
PiAA)^P2HT)-P,^{T) 

= 0.38 X 0.26 X 0.13 x 0.38 x 0.26x 
U.39 X 0.21 X 0.27 x 0J3 

= 3.6479448 X 10 



(XVI) 



P(GATGACATT|2+) - P2+{G) ■ P-s+{A) ■ P^+{Ty 

P.,,iG)-P,^{A)-Pi^iCy 
PMA)-P,*{T)'Pi^iT) 

~ 0.21 X 0.09 X 0.13 X 0.21 x 0.09x 
0.28 X 0.2G x 0.13 x 0.13 

= 5.71332730 x lO'" 



WO 



P(GATGACATT(3+) 



(XVII) 



PMG)-Pi+{A)'P2^{Ty 

PMG) ■ PMA) ' p^+icy 

P,,iA)-PMT)'Bz^iT) 
0.39 X 0.21 x 0.27 x 0.39 x 0.21 x 
0.26x0.09x0.13x0.27 
1.4874917 X 10-« 
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P(GATGACATTil-) = 



(XVIII) 



P,-iG)-P2-{A)'P,-{Ty 

Pi-iG)'P2AA)'P3-iCy 

Px-iA)'P2-iT)-P^-iT) 

0-38 X 0.27 x 0.21 x 0.38 x 0.27 x 

0.38 X 0.13 X 0.25 x 0.21 

5.7332419 x lO"** 
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P{GATGACATT|2-) = 



(XIX) 



P2-{G)-P,-{A)-P,-{Cy 

P2-{A)-Pz-{T)-Pi-{T) 
0.26 X 0.13 X 0.10 X 0.26 x 0.13 x 
0.39 X 0.27 X 0-21 x 0.10 
2.5262776 x 10"^ 



p(gatgacatt!3 ) = 



(XX) 



P,-{G)-P,-iA)-P2~{Ty 

P,-{G)-P,-{A)-P2-{C)- 

P^AA)-P,-{T)-P2-{T) 

0.28 X 0.13 X 0.25 x 0-28 x 0.13x 

0.21 X 0.13 X 0.10 x 0.25 

2.2007130 X 10-^ 



P(GATGACATT|N) = 



(XXI) 



PN{G)-PNiA)'PN{Ty 

Fn{G)-Pn{A)'Pn{C)- 

Pf,{A)'Pn{T)-Pn{T) 

0.29 X 0.21 X 0.20 x 0.29 x 0.21 x 

0.30 X 0.21 X 0-20 X 0.20 

1.8692402 X 10"^ 



Given the values of P(S|o), we can determine the probability that the given sequence 
15 segment is in state c, P(ct|S) using equation XXII (Bayes' Rules): 

P{c) . P{SW) 



Pia\S) - 



^[P(*)-P(5|i)] 



(XXII) 



Equations XXIII through XXIX show the calculations for each of the states. 
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Y.iPii) ^ P{S\i)] 



(XXIII) 



3.03fl9540xlO" " 
1.1060761x10- 

0.27484131 



P(2+|5) = 



(XXIV) 



4. 7611061x10"° 
i7t060761xin-*' 

0.004304501 



-30 



P{3+|5) = 



(XXV) 



1.12396764xlO~J 

i.ioeoTftixio-" 
0.11156173 



P(l-|5) = 



(XXVI) 
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4. 7777018X10 

i.io6a7enxio 
0.43195053 



P(2-|5) 



25 (XXVII) 



_ 2.1062313x10 
1. 1060761 XlU 

= 0.019033331 



-» 
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P(3-|6') = 



(XXVIII) 



1.8839275x10 
1.1060701 X 10- 

0.U17032531 
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(XXIX) 



PCA71C\ — 1.657002x10 

— 1.1060761x10-*' 

= 0.1407G807 



-7 



The coding probability function indicates a 43% probability that the sequence is coding in 

10 the first reading frame of the reverse-complement strand (-) of the sequence provided, based on 

C3 the zero order inhomogeneous Markov model used. While the most probable state, it is also true 

m that there is a greater probability (57%) that the sequence is not in that state. 
!jf An investigator can apply the bias function method to impose a bias based on 

fU prior knowledge of sequence features, such as an EST alignment to the subject sequence, or 

im homology to a previously characterized sequence. For example, given an EST alignment to the 

r , subject sequence that implies the sequence is coding on the positive strand, a bias function can be 

H defined that summarizes that observation. Equation XXX is one example of such a function: 



(XXX) 



\ 0.05 if rr^ {1^ ,2-^ ,3-^} 
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This bias function does not exclude the possibility that the sequence is noncoding or 
coding on the reverse complement strand, although it does effectively bias the a priori 
probability that the sequence is coding in one of the forward three reading frames. The function 
25 above states that the three forward coding states are 1 9-fold (0.95/0.05) more probable than the 
other states, which is an assertion by the investigator that he is confident that the EST alignment 
is correct in indicating that the sequence is coding on that strand. 

Given the bias function defined above, the values for P'(S|a) are determined as before for 
the unbiased case. To calculate P'(c^iS), however, equation XXXI is used: 
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(XXXI) 



P'{<r\S) 



4>{a) • P{a) ■ P{S\a) 
■ P{i) ■ P{S\i)] 



5 The equations to determine P'(o|S) for each state are shown in equations XXXII through 

XXXVIII: 



15 
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(XXXII) 



p'(l+\S) = <t>(i+)P(ihPis\i+) 



• Pd) ' P{S\i}\ 

i 

0.95-T^(3,6479448xl0~") 

0.95 ••j^ (3. 6479448 XlO-«) t ...+0.0&- ^(1.8692402x10-6 ) 

2.8879503x10"' 

4.4389294x10-^ 

0.65045U95 



(XXXIII) 



P'f2 ' \S) = 0 95 , 
= 0.010187213 



(XXXIV) 



= 0.2652289 



(XXXV) 



P'il-\S) = 



"^•^^4.4399294x10-7 



= 0.05380379 



25 
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(XXXVI) 



P'{2-\S) = 



^'^"^4. 4399294x10-7 



= 0.0023707938 



(XXXVII) 



P'{3^|5) 



^ 474399204x10-7 

0.0004239267G 



(XXXVIII) 



P'{N\S) ^ 0.05 



4.43992?)4xin-7 

= 0.0017534085 



Given the bias function 0(0), the resulting coding potential calculation indicates a 65% 
probability that the sequence is coding in the first reading frame on the forward strand. The 
result represents the coding probability given the assumptions of the investigator stated as the 
bias function. 

EXAMPLE 3 

The following is a copy of the output of a program implementing the method 
described above with and without a bias function. The following sequence is a genomic sample 
from the organism Arabidopsis thaliana, landsberg. 



TACTCAAAAATATATTCCATGCTTAATTAGGCCGGATTCGCGGTGACGATGCACCAAGAGCGGTTTTTCCGA 
GCATTGTAGGCCGTCCTCGCCACACCGGTGTGATGGTTGGGATGGGACAAAAGGATGCTTATGTTGGAGACGAGGCTC 
AATCAAAACGTGGTATCTTGACTCTGAAGTACCCAATTGAGCATGGAATTGTTAATAATTGGGATGACATGGAGAAGA 
TTTGGCATCACACTTTCTACAATGAGCTTCGTGTTGCCCCTGAAGAACATCCGGTTCTCTTGACCGAAGCTCCTCTCA 
ATCCGAAAGCTAACCGTGAGAAGATGACTCAGATCATGTTTGAGACATTCAATACTCCTGCTATGTATGTTGCCATTC 
AAGCTGTTCTCTCACTCTATGCCAGTGGCCGTACTACTGGTCAGTACATTACTACATTCTTTTTATACCGTTTGGTTG 
AAATAAAATTCGGTTTGGTTCGATTCGAGTTTGCTCTCATTATTTTTATTTTGTTGGTTAGGTATTGTTTTGGACTCC 
GGAGATGGTGTGAGCCACACGGTACCAATCTACGAGGGTTATGCACTTCCACACGCAATCCTGCGTCTTGATCTTGCA 
GGTCGTGACCTAACCGACCACCTTATGAAAATCCTGACAGAGCGTGGTTACTCTTTCACCACAACTGCTGAGCGTGAG 
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ATTGTTAGAGACATGAAGGAGAAGCTCTCTTACATTGCCTTGGACTTTGAACAAGAGCTCGAGACTTCCAAAACAAGC 
TCATCCGTTGAGAAGAGCTTCGAGCTGCCAGACGGTCAAGTGATCACCATCGGGGCAGAGCGTTTCCGATGCCCTGAA 
GTTCTGTTTCAGCCATCGATGATCGGAATGGATWITCCGGGAATTCATGAAACTACTTACAACTCAATCATGAAATGT 
GATGTGGATATCAGGAAGGATCTTTATGGAAACATTGTGCTTAGTGGTGGCACCACAATGTTCGATGGGATTGGTGAT 
5 AGGATGAGTAAAGAGATCACAGCGTTGGCTCCAAGCAGTATGAAGATCAAAGTGGTGGCTCCACCGGAAAGGAAGTAC 
AGTGTCTGGATCGGTGGCTCTATCTTGGCTTCCCTCAGTACTTTCCAGCAGGTAAATTACTTACTATACTTT^TACAT 
A7VAGTCTATTAGTGATTTGATGTATAAAGTGTTACAAAAATGTGTTCCAAATTTGCAGATGTGGATTGCGAAAGCGGA 
GTATGATGAATCTGGACCGTCAATCGTCCACAGGAAGTGCTTCTGATCAAAAGTCACCAAGTAAAACAAGAGCGGTAA 
AAATTTTGATATCAGTTTTTCACCCTGAAGCCAGTTGCTATAATTACTCACAACTTCTCTATTTGTGTTCTTTTATTC 
10 TTGTCCCTCGTTGTTCATTTTAATCTCTTTTTTGCAACAAAGCAACTTAAAAAAACAGAGCAGTCATTAACAGAATGT 
TATTATTATATATATGTATACATATTAGTATACACCCATTATTTCATTAAAACATTTATCATATAAGGATAGGATTCT 
ATACATCGATATATTTATTTTGTTGACACTATTCAGCACATGCTTATGTCTTATCTTGTTAGTATATGTAACCAAAGA 
CAAATAATAGATGCTACAAATTGTTTTCTTTGAAGCAAAAATTTCAATCTTAAAATTGTTTTTTTCCAGGTTACACAA 
AAAAAACTTGTAGTTTGTAAATTTTCTATACAATTTTGGGGATCTCAACAAGAACATGAACTTCAACTTCTAGTCATA 
15 TGACGACCTGAGTCTGCGCGGCTGTGTYATCTCTTTGCTGCAGTAAATGTTTACAAGTGGTGTGTAAATTGGTACTGAT 
TCAAAAGCTTTAAGAAATCTACACATTTCGTGAAATTATTTAGCAGACTTGATATTAAAAATCTAGGATAAAATGACT 
Q ATCCAAAGACAAATAGGACTGTTTCACATGTTCCCCTGATTCTTGTAGCTCATAACTCATCAGCAGTTAACTTTTCTA 
In CCTCATACACGCTCGCAATNCGTTTGGAATTATCAGCTNTAATTTTTCTAATTCTTTGGAAATTATTAGCAGCTCGAT 
S CAAATGGGGCATGGCTTCTTCTTCTATCTGCAACTCATCTAAACTTTCCATGAAGAAACAAAGCT (SEQ. ID. 
M NO. 1) 

The sequence below is the same Arabidopsis sequence after coding probabilities have 
f ^; been determined without a bias, the coding strand has been determined, and each nucleotide has 
^ been classified in its most probable state of the four on the coding strand (dashes represent the 
^ state of noncoding). 



1 



1 



30 



61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 



111111111111313333333333333333133333333333333333333333333333 
333323333333333333333333333313333333333333333333333333333333 
333333333333333333333333333333333313333313333333333333133333 
333333333133133333333133333333333333333333333333333333333133 
333333333333333133333333333333333313333333333333333333333333 
333-33333-333333-3333333333333333-33333333333333333333333333 
333333333333—3—3—333333333-33 



35 



11 11-1 



40 



45 



-1111111111111111111111111111111111111111111111111111111-111 
1111111111111111111111111111111111111111111111111111111-1111 
1111111111-111-11111111111111-111111111111111111111111111111 
1111111111111-11111-11111111-1111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111111111111111111111111111111-1111 
11111111111111111-1111111111111111111111111111-1111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111131111111111111131 



222-2222222222-22-222-222222-3333333333333333333333333 

3333333333333333 — 33-3 — 3 — 3 33-33333333-333 

- — 333 — 3 



50 



50 



10 



15 



1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3 33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

__22 2222-1222222222222222221222222222222222222222222 

22222 

The classifications are now filtered. First, simple gaps are filled (XYX are reclassified as 



XXX): 



£5 



m 



35 



40 



45 



50 



1 
61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 



1 

111111111111313333333333333333133333333333333333333333333333 
333323333333333333333333333313333333333333333333333333333333 
333333333333333333333333333333333313333313333333333333133333 
333333333133133333333133333333333333333333333333333333333133 
333333333333333133333333333333333313333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 — 3—3—333333333333 

11 1111- 

-11111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111131111111111111131 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333— 3333--3— 3 333333333333333 

333 — 3 



3 3333 33333 3 

33313333333333 13-2222222222222222222222222222 2 

__22 2222-1222222222222222221222222222222222222222222 
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2161: 22222 

Next, XXYXX gaps are reclassified as XXXXX: 

1 

111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11 1111- 

-11111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111131 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333—3333 333333333333333 

333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

__22 2222-1222222222222222222222222222222222222222222 

22222 



Next, XXYYXX gaps are reclassified as XXXXXX: 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333 333333333333 

mil 
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541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



llllllllllllllllllllllllllllllllllllllllllllimiimimm 
llllllllllllllllllllllllllllllllllllllllllllllllllllll^lim 
llllllllllllllllllllllllllllllllllllllllllllimillimi^m 
llllllllllllllllllllllllllllllllllllllllllllllllimm^mi 
lllllllllllllllllllllllllllllllllllllllllllllimillimmi 
lllllllllllllllllllllllllllllllllllllllllllllllllimimill 
llllllllllllllllllllllllllllllllllllllllllimiimmmm 
llllllllllllllllllllllllllllllllllllllllllllimimmmi^ 
lllllllllllllllllllllllllllllllllllllllllllllimiimmm 
1111111111111111111111111111131 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333333333 333333333333333 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



Next, XYYX gaps are reclassified as XXXX: 



1 

61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11111 

llllllllllllllllllllllllllllllllllllllllllllllimmimm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
llllllllllllllllllllllllllllllllllllllllllllllllimimim 
llllllllllllllllllllllllllllllllllllllllllllllllimmim^ 
lllllllllllllllllllllllllllllllllllllllllllllimmmm^l 
llllllllllllllllllllllllllllllllllllllllllllllimimmm 
llllllllllllllllllllllllllllllllllllllllllllllllimiimm 
lllllllllllllllllllllllllllllllllllllllllllimillimmm 
lllllllllllllllllllllllllllllllllllllllllllllllllimimm 
1111111111111111111111111111131 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333333333 333333333333333 
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• 



1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



-333- 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



Next, XYX gaps are reclassified as XXX: 



1 

61 
121 
181; 
241: 
301: 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 



1 

111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11111 

lllllllllllllllllllllllllllllllllllllllllllllllimilimill 
llllllllllllllllllllllllllllllllllllllllllllllllllllimim 
llllllllllllllllllllllllllllllllllllllllllllllllllimilllll 
111111111111111111111111111111111111111111111111111111111111 
lllllllllllllllllllllllllllllllllllllllllllllllllllllllimi 
lllllllllllllllllllllllllllllllllllllllllllllllllllllllimi 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111111 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333333333 333333333333333 

— 333 



3333 33333 

33333333333333 13-2222222222222222222222222222- 
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2101: 

2161: 22222 



-222222222222222222222222222222222222222222222222 



Next, regions between coding regions that are not introns are reclassified according to the 
adjacent sequences: 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 
1321: 
1381: 
1441: 
1501: 
1561: 
1621: 
1681: 
1741: 
1801: 
1861: 
1921: 
1981: 
2041 : 
2101: 
2161: 



111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333 

mil 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiimm^ii^ii^^^^-^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimimm^ii^ii^^i^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimmmm^ii^i^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimimm^i^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimimiimm^i^^^^^^^^^-^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiimm^m-^-^^^^^^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimm^imi^^^^i^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimmm^i^^^i^^^^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii^imm-^^i^^^^^^^^ 
1111111111111111111111111111111 

222222222222222222222222222233333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333 



3333333333333333333333333333333333333333333333 

333333333333333311132222222222222222222222222222222222222222 
222222222222222222222222222222222222222222222222222222222222 

22222 



Next, the sequence is checked for frameshifts and reclassified accordingly: 



61- lllllllllllllllllllllllllllllllllllllllllllllim^mmm^ 

121: lllllllllllllllllllllllllllllllllllllllllllimmm^m^^^ 

241- 111111111111111111111111111111111111111111133333333333333333 

301- 333333333333333333333333333333333333333333333333333333333333 

361' 333333333333333333333333333333333333333333333333333333333333 
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421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



333333333333333333333333333333333 

11111 

lllllllllllllllllllllllllllllllllllllllllllllllllllllimm 
lllllllllllllllllllllllllllllllllllllllllllllimilllimm 
lllllllllllllllllllllllllllllllllllllllllllllllimillllllll 
llllllllllllllllllllllllllllllllllllllllllllllllllllllllim 
lllllllllllllllllllllllllllllllllllllllllllllllllllllimm 
lllllllllllllllllllllllllllllllllllllllllllllllllllllimm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
lllllllllllllllllllllllllllllllllllllllllllllllllilimilill 
lllllllllllllllllllllllllllllllllllllllllllllllimiimmi 
1111111111111111111111111111111 

222222222222222222222222222222222222222233333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333 



3333333333333333333333333333333333333333333333 

333333333333333333333333333333333222222222222222222222222222 
222222222222222222222222222222222222222222222222222222222222 

22222 



Finally, the sequence is translated according to each class in each coding region, where z 
"x" indicates a stop codon: 

1 : XRFFRALxAVLATPVxWLGWDKRMLMLETRLNQNVVSxLxSTQLSMELLIIGMTWRRFGI 

61 : TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 

121 : TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILTERGYSFT 

181 : TTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFRCPEVL 

241 : FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEITALA 

301 : PSSMKIKVVAPPERKYSVWIGGSIXVPNLQMWIAKAEYXNLDRQSSTGSASDQKSPSKTR 

361 : AVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASSSICNSSKLSMKK 

421 : QSX (SEQ. ID. NO. 2) 

The following sequence is the same Arabidopsis sequence used above, but with an 
applied bias. Two bias functions are given by equations XXXIX and XL: 
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^('^^ ^ I 0.. 



0.95 if a-€ {l+,2+,3+,l-,2- 3-} 
0.05 if<T=JV 



(XXXIX) 



(XL) 



^0 



0.05 if o-e {l + ,2-^,3+,l-,2 ,3-} 
0.95 if = .V 



where 0i is applied to a range of the DNA to which an EST has been associated, while 02 
is applied to a range of the DNA to which a gap (or intron) in the EST has been associated. 
Specifically, 0^ is applied to nucleotides 1093 through 1 137 and 1219 through 1291, while 02 is 
applied to nucleotides 1138 through 1218. The probabilities are calculated with the bias, the 
coding strand is determined, and each nucleotide is classified as the most likely state. The 
resulting sequence is depicted below. 



20 



25 



30 



35 



40 



1 
61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 



1 

111111111111313333333333333333133333333333333333333333333333 
333323333333333333333333333313333333333333333333333333333333 
333333333333333333333333333333333313333313333333333333133333 
333333333133133333333133333333333333333333333333333333333133 
333333333333333133333333333333333313333333333333333333333333 
333-33333-333333-3333333333333333-33333333333333333333333333 

333333333333 — 3 — 3 — 333333333-33 

11 — 11-1- 

-lllllllllllllllllllllllllllllllllllllllllllllllllimil-111 
1111111111111111111111111111111111111111111111111111111-1111 
1111111111-111-11111111111111-111111111111111111111111111111 
1111111111111-11111-11111111-1111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111111111111111111111111111111-1111 
11111111111111111-1111111111111111111111111111-1111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
11111111111111311111111111111311111111-1 

221221222122222213333333333333333333333333 

3333333333333333333333333333333-33-33333333-333 

—333 — 3 
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1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3—33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

_„22 2222-1222222222222222221222222222222222222222222 

22222 



10 



Filtering steps are then applied as before: XYX to XXX: 



15 



.§0 



^40 



35 



40 



45 



1 
61 

121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



111111111111313333333333333333133333333333333333333333333333 
333323333333333333333333333313333333333333333333333333333333 
333333333333333333333333333333333313333313333333333333133333 
333333333133133333333133333333333333333333333333333333333133 
333333333333333133333333333333333313333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333— 3--3 333333333333 

11 1111- 

-11111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111131111111111111131111111111 

221221222122222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

— 333 — 3 



3 3333 33333 3 

33313333333333 13-2222222222222222222222222222 2 

__22 2222-1222222222222222221222222222222222222222222 

22222 



50 



XXYXXtoXXXXX: 



1: 
61: 



111111111111313333333333333333333333333333333333333333333333 
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333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11—1111- 

-lllllllllllllllllllllllllllllllllllllllllllllllllimmm 
lllllllllllllllllllllllllllllllllllllllllllllimimimm 
lllllllllllllllllllllllllllllllllllllllllllllimiimmm 
lllllllllllllllllllllllllllllllllllllllllllllllimmmm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
llllllllllllllllllllllllllllllllllllllllllllllllllimmm 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

—333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

—22 2222-1222222222222222222222222222222222222222222 

22222 

XXYYXX to XXXXXX: 

1, 1 

61: 111111111111313333333333333333333333333333333333333333333333 

121: 333333333333333333333333333333333333333333333333333333333333 

181: 333333333333333333333333333333333333333333333333333333333333 

241- 333333333333333333333333333333333333333333333333333333333333 

301- 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333 

481, 11111 

541: 111111111111111111111111111111111111111111111111111111111111 
601: 111111111111111111111111111111111111111111111111111111111111 

721- 111111111111111111111111111111111111111111111111111111111111 
781: 111111111111111111111111111111111111111111111111111111111111 
841: llllllllllllllllllllllllllllllllllllllllllllllllllllllllim 
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901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



lllllllllllllllllllllllllllllllllllllllllllllllllllllimill 
lllllllllllllllllllllllllllllllllllllllllllimiilllll-^imi 
lllllllllllllllllllllllllllllllllllllllllllllllllllllimm 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



XYYX to XXXX: 



1 

61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333 333333333333 

mil 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimm 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiim 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiim^ii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiim 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimi^mim 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimim-^ 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimm 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimimm 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

—333 



60 



10 



1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



XYXto XXX: 



15 



^0 



^25 



35 



40 



45 



50 



1 
61 
121 
181 
241 
301 
361 
421 
481 
541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 



-333333333333- 



333333333333 

mil 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimii-^ii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiimim-^m 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiiiiiiimi 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiiiiii 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimmiimim 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimmm 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimi 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiim 
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimmimimm 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 

Gaps between coding regions that are not introns are filled as before: 



61 



111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333333333333333333333333 

11111 

111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
111111111111111111111111111111111111111111111111111111111111 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333 



3333333333333333333333333333333333333333333333 

333333333333333311132222222222222222222222222222222222222222 
222222222222222222222222222222222222222222222222222222222222 

22222 

Frameshifts are verified and nucleotides are reclassified accordingly: 



61: 111111111111111111111111111111111111111111111111111111111111 

121: 111111111111111111111111111111111111111111111111111111111111 

181: 111111111111111111111111111111111111111111111111111111111111 

241: 111111111111111111111111111111111111111111133333333333333333 

301: 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333333333333333333333333 

481: 11111 

541: 111111111111111111111111111111111111111111111111111111111111 

601: 111111111111111111111111111111111111111111111111111111111111 

661: 111111111111111111111111111111111111111111111111111111111111 

721: 111111111111111111111111111111111111111111111111111111111111 

781: 111111111111111111111111111111111111111111111111111111111111 
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llllllllllllllllllllllllllllllllllllllllllllllllllmlllml 
lllllllllllllllllllllllllllllllllllllllllllllllllllmmm 
llllllllllllllllllllllllllllllllllllllllllllmllllmmm 
llllllllllllllllllllllllllllllllllllllllllllllmmllllml 
1111111111111111111111111111111111111111 

222222222222222222222222222233333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333 



3333333333333333333333333333333333333333333333 

333333333333333333333333333333333222222222222222222222222222 
222222222222222222222222222222222222222222222222222222222222 

22222 



And the sequence is translated as before: 

1 : XRFFRALxAVLATPVxWLGWDKRMLMLETRLNQNVVSxLxSTQLSMELLIIGMTWRRFGI 

61 : TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 

121 : TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILTERGYSFT 

181 : TTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFRCPEVL 

241 : FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEITALA 

301 : PSSMKIKVVAPPERKYSVWIGGSILASXQMWIAKAEYXNLDRQSSTGSASDQKSPSKTRA 

361 : VKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASSSICNSSKLSMKKQ 

421 : SX (SEQ. ID. NO. 3) 



The resuhing amino acid sequence (SEQ. ID. NO. 3) differs from the amino acid 
sequence calculated without a bias (SEQ. ID. NO. 2). The relative accuracy of the two amino 
acid sequences can be determined by comparison to a known sequence. SEQ. ID. NO. 2 and 
SEQ. ID. NO. 3 are compared to the translation of the actin gene from Arabidopsis thaliana, 
Columbia (SEQ. ID. NO. 4). Dashes indicate gaps in the sequence and asterisks indicate a match 
among all three sequences. The predicted amino acid sequences (SEQ. ID. NOs. 2 and 3) are 
based on ?in Arabidopsis thaliana, landsberg ecotype. A comparison of the predicted with a 
"^ow^n Arabidopsis thaliana, Columbia ecotype amino acid sequence (SEQ. ID. NO. 4) is shown 
below. The sequence set forth in Box A illustrates an area of the biased sequence that shows a 
higher level of identity with the Arabidopsis thaliana, Columbia sequence. 
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unbiased -XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNVVSX--LXSTQLSMELLIIG M 

biased -XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNVVSX—LXSTQLSMELLIIG— M 

Columbia GDDAPRAVFPSIVGRPR-HTGVMVGMGQKDAYVGDEAQSKRGILTLKYPIEHGIVNNWDD 



10 



unbiased TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
biased TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
Columbia MEKIWHHTFYNELRVAPEEHPVLLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 



15 



unbiased LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 
biased LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 
Columbia L-ASGRTTGG IVLDSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 



-k -k -k ^ -k -i: 



■k-kic-k-k'k'k'k^-k'k^-k^^-k'k-k'k^-k'k'k-k^-k^'k-k'k-k-k-k'k-k-k'k-k-k-k 



unbiased ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 
biased ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 
Columbia ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 

'^■k^-k-k-k-k-^^-k-kic-k-kic-kkrif-k'k^-k-^-k'k^-k'k'k-k-k-k-k'kif'k-ki^ 



i^:: unbiased 

biased 
'25 Columbia 



i-y unbiased 
y biased 
k:^0 Columbia 



FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFGGIGDRMS 

•k-k-k-k^-k-k-k-k-k'k-k^^i^^^-kiK-^-k-k^-k-k'k-k'k-k-k'k^-k'k-k'k'k-kit-ki^ -k -k -k -k -k -k -k 

Box A 



i(^'ki(ific'k'ki<-ki<-k-kic-k-k-k'ki<:-k-k'k'k^'k 



KEITALAPSSMKIKWAPPERKYSVaTIGGSIX VPNLQMWIAKAEYXNLDRQSSTG 

KEITALAPSSMKIKVVAPPERKYSV/JIGGSILAS XQMWIAKAEYXNLDRQSSTG 

KEITALAPSSMKIKVVAPPERKYSV^IGGSILASLSTFQQMQMWIAKAEY DESG 

c -k -k -k -k -k i< -k -k -k -k -k -k -k -k -k -k 



unbiased 
biased 
35 Columbia 



SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 
SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 
P3 IVHRKCF 



unbiased 
biased 
40 Columbia 



SICNSSKLSMKKQSX 
SICNSSKLSMKKQSX 
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We Claim 



1 . A method for determining a probability for one or more states for a nucleotide in a nucleic 
acid sequence, comprising: 



oligonucleotide in said nucleic acid sequence; 

b) determining transition probabilities for each of said states for nucleotides within said 
nucleic acid sequence following said initial oligonucleotide; 

c) determining a probability for said nucleic acid sequence for each of said states; and, 

d) determining a probability for each of said states for said nucleotide based upon said 
probability of said nucleic acid sequence and a bias. 

2. The method of claim 1, wherein said probability for each of said states for said nucleotide is 
determined using an inhomogeneous Markov model having eight states, wherein said eight states 
are: first reading frame positive strand (1+); second reading frame positive strand (2+); third 
reading frame positive strand (3+); first reading frame negative strand (1-); second reading frame 
negative strand (2-); third reading frame negative strand (3-); noncoding positive strand (N+); 
and, noncoding negative strand (N-). 

3. The method of claim 2, wherein said probability for each of said eight states for said 
nucleotide in step e) is determined using the equation 



4. The method of claim 1, wherein said nucleotide is the middle nucleotide in said nucleic acid 
sequence. 



a) determining an initial oligonucleotide probability for each of said states for an initial 



P(f\S) = 





i€{l+,2+,3+,N+,l-,2-,3-,JV-} 
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5. The method of claim 1, wherein said nucleic acid sequence is part of a longer nucleic acid 
sequence. 

6. The method of claim 1, wherein said bias is between 0.0 and 0.9 or greater than LI. 

7. A method for determining a probability for one or more states for a nucleotide in a nucleic 
acid sequence, comprising: 

a) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in said nucleic acid sequence; 

b) determining transition probabilities for each of said states for nucleotides within said 
nucleic acid sequence following said initial oligonucleotide; 

c) determining a probability for said nucleic acid sequence for each of said states; and, 

d) determining a probability for each of said states for said nucleotide based upon said 
probability of said nucleic acid sequence, wherein said determining a probability for each of said 
states is capable of accepting a bias. 

8. A method for determining a probability for each of one or more states for more than one 
nucleotide in a nucleic acid sequence comprising: 

a) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

b) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

c) determining a probability for said window for each of said states; 

d) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

e) repeating steps a) through d) for each remaining nucleotide in said nucleic acid 
sequence. 

9. The method of claim 8, wherein said more than one nucleotide are contiguous, and step e) is 
performed sequentially from said first nucleotide to a last nucleotide. 
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10- The method of claim 9, wherein said probability for each of said states for said more than 
one nucleotide is determined using an inhomogeneous Markov model having eight states, 
wherein said eight states are: first reading frame positive strand (1+); second reading frame 
positive strand (2+); third reading frame positive strand (3+); first reading frame negative strand 
(1-); second reading frame negative strand (2-); third reading frame negative strand (3-); 
noncoding positive strand (N+); and, noncoding negative strand (N-). 

1 1 . The method of claim 10, wherein said probability for each of said states for said more than 
one nucleotide is determined using the equation 



12. The method of claim 8, wherein said nucleic acid sequence is part of a longer nucleic acid 
sequence. 

13. The method of claim 8, wherein each nucleotide in said more than one nucleotide is the 
middle nucleotide in its own window. 

14. The method of claim 8, further comprising: 

f) extending said nucleic acid sequence if said window extends beyond either end of said 
nucleic acid sequence, wherein said extending is accomplished by copying nucleotides from an 
end of said nucleic acid sequence at which said window is located to produce a copied nucleotide 
sequence, and adding said copied nucleotide sequence to said end. 

1 5. The method of claim 8, wherein said window has a length of about 75 to about 125. 

16. The method of claim 8, wherein said bias is between 0.0 and 0.9 or greater than 1.1. 



P(f\S) = 



4>if) ■ Pf PfjS) 
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17. A method for determining strand coding of a nucleic acid sequence, comprising: 

a) determining a probability of each of one or more states for each nucleotide in said 
nucleic acid sequence based upon a bias, wherein each of said states is either a positive strand 
state or a negative strand state; 

b) summing said probabilities of said positive strand states for each of said nucleotides to 
produce a sum of probabilities for positive states; 

c) summing said probabilities of said negative strand states for each of said nucleotides to 
produce a sum of probabilities for negative states; and, 

d) deciding one of 

i) coding is mixed or not detectable if a first function of said sum of probabilities 
for positive states and said sum of probabilities for negative states is less than a threshold 
value; 

ii) coding is on said positive strand if a second function of said sum of 
probabilities for positive states is greater than a third function of said sum of probabilities 
for negative states and said first function is not less than said threshold value; and 

iii) coding is on said negative strand if said second function of said sum of 
probabilities for positive states is not greater than said third function of said sum of 
probabilities for negative states and said first function is not less than said threshold 
value. 

18. The method of claim 17, wherein said sum of probabilities for positive states is X, said sum 
of probabilities for negative states is Y and said first function is jiX, 7) = -j^ — ^ . 

19. The method of claim 1 8, wherein said threshold value is from about 0.4 to about 0.6. 

20. The method of Claim 17, wherein said sum of probabilities for positive states is X, said sum 
of probabilities for negative states is Y, said second function is f(X)=X, and said third function is 
f(Y)=Y. 
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2L The method of claim 17, wherein step a) comprises: 

e) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

f) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

g) determining a probability for said window for each of said states; 

h) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

i) repeating steps e) through h) for each remaining nucleotide in said nucleic acid 
sequence. 

22. A method for determining the extent of an open reading fi-ame within a nucleic acid 
sequence, comprising: 

a) determining the probability of each of one or more states for each nucleotide in said 
nucleic acid sequence based upon a bias, wherein each of said states is either a coding state or a 
noncoding state; 

b) determining the coding strand of said nucleic acid sequence; and, 

c) determining the points within said nucleic acid sequence in said coding strand at which 
the sum of the probabilities of said coding states for each nucleotide drops below a first threshold 
value for a number of nucleotides greater than a second threshold value, wherein ends of said 
open reading frame are indicated at said points. 

23. The method of claim 22, wherein said first threshold value is about 0.4 to about 0.6. 

24. The method of claim 22, wherein said second threshold value is about 500 to about 700. 

25. The method of claim 22, wherein step c) comprises: 

d) determining the sum of said coding states for a middle nucleotide located in said 
nucleic acid sequence; 
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e) repeating step d) sequentially for nucleotides located on a first side of said middle 
nucleotide until either 

i) the sum of the probabilities of said coding states drops below said first 
threshold value for a number of nucleotides greater than said second threshold value, or 

ii) an end of said nucleic acid sequence is reached, 

at which point an end of the open reading fi^ame is indicated; and, 

f) repeating step e) for nucleotides located on a second side of said middle nucleotide. 

26. The method of claim 22, wherein said nucleic acid sequence is part of a longer nucleic acid 
sequence. 

27. The method of claim 22, wherein step b) comprises 

d) summing probabilities of positive strand states for each of said nucleotides to produce 
a sum of probabilities for positive states; 

e) summing probabilities of negative strand states for each of said nucleotides to produce 
a sum of probabilities for negative states; and, 

f) deciding one of 

i) coding is mixed or not detectable if a first fianction of said sum of probabilities 
for positive states and said sum of probabilities for negative states is less than a threshold 
value; 

ii) coding is on said positive strand if a second function of said sum of 
probabilities for positive states is greater than a third function of said sum of probabilities 
for negative states and said first function is not less than said threshold value; and 

iii) coding is on said negative strand if said second function of said sum of 
probabilities for positive states is not greater than said third function of said sum of 
probabilities for negative states and said first function is not less than said threshold 
value. 

28. The method of claim 22, wherein step a) comprises: 
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d) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

e) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

f) determining a probability for said window for each of said states; 

g) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

h) repeating steps d) through g) for each remaining nucleotide in said nucleic acid 
sequence. 

29. A method for determining the location of insertions and deletions within a nucleic acid 
sequence, comprising: 

a) determining the probability of each of one or more states for each nucleotide in said 
nucleic acid sequence based upon a bias, wherein each of said states is either a coding state or a 
noncoding state; 

b) setting a length for a window; 

c) determining which state has a maximum mean probability for said nucleic acid 
sequence on a first side of a middle nucleotide in said window, wherein said window begins at a 
first nucleotide; 

d) determining which state has a maximum mean probability for said nucleic acid 
sequence on a second side of said middle nucleotide in said window; 

e) determining that a deletion or insertion occurred at said middle nucleotide if 

i) said state with said maximum mean probability on said first side of said middle 
nucleotide is different from said state with said maximum mean probability on said 
second side of middle nucleotide, and 

ii) either an average of hypothetical state probabilities for said window with an 
insertion at said middle nucleotide or an average of hypothetical state probabilities for 
said window with a deletion at said middle nucleotide is greater than a sum of said 
middle nucleotide's coding states probabilities; and, 
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f) repeating steps c) through e) for each remaining nucleotide in said nucleic acid 
sequence after said first nucleotide, wherein said window begins at each remaining nucleotide in 
turn. 

30. The method of claim 29, further comprising: 

g) determining that a deletion occurred if said average of hypothetical state probabilities 
for said window with an insertion at said middle nucleotide is greater than an average 
hypothetical state probabilities for said window with a deletion at said middle nucleotide or that 
an insertion occurred if said average hypothetical state probabilities for said window with an 
insertion at said middle nucleotide is not greater than an average of hypothetical state 
probabilities for said window with a deletion at said middle nucleotide. 

3 1 . The method of claim 29, wherein said nucleic acid sequence is part of a longer nucleic acid 
sequence. 

32. The method of claim 29, wherein said repeating in step f) is performed sequentially from 
said first nucleotide to a last nucleotide. 

33. The method of claim 29, wherein said window is about 75 to about 125. 

34. The method of claim 29, wherein step a) comprises: 

g) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

h) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

i) determining a probability for said window for each of said states; 

j) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

k) repeating steps g) through j) for each remaining nucleotide in said nucleic acid 
sequence. 
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35. A method for determining exon location within a nucleic acid sequence, comprising 

a) determining the probability of each of one or more states for each nucleotide in said 
nucleic acid sequence based upon a bias, wherein each of said states is either a coding state or 
noncoding state; 

b) determining the coding strand of said nucleic acid sequence; 

c) determining the extent of an open reading frame within said nucleic acid sequence; 

d) classifying each nucleotide in a coding class or a noncoding class based on a most 
probable state for said coding strand; 

e) reclassifying each nucleotide according to defined rules; and, 

f) determining that regions of said nucleic acid sequence in said coding class are exons. 

36. The method of claim 35, wherein step e) comprises: 

g) reclassifying a noncoding nucleotide to a class of an adjacent nucleotide on a first side 
of said noncoding nucleotide and an adjacent nucleotide on a second side of said noncoding 
nucleotide if said adjacent nucleotide on said first side and said adjacent nucleotide on said 
second side all are of a single class; 

h) reclassifying a nucleotide to a class of two adjacent nucleotides on a first side and two 
adjacent nucleotides on a second side if said two adjacent nucleotides on said first side and said 
two adjacent nucleotides on said second side all are of a single class; 

i) reclassifying a first pair of adjacent nucleotides having a same class to a class of two 
adjacent nucleotides on a first side of said first pair and two adjacent nucleotides on a second 
side of said first pair if said two adjacent nucleotides on said first side and said two adjacent 
nucleotides on said second side all are of a single class; 

j) reclassifying a second pair of adjacent nucleotides having a same class to a class of an 
adjacent nucleotide on a first side of said second pair and an adjacent nucleotide on a second side 
of said second pair if said adjacent nucleotide on said first side and said adjacent nucleotide on 
said second side both are of a single class; 

k) reclassifying a nucleotide to a class of an adjacent nucleotide on a first side of said 
single nucleotide and an adjacent nucleotide on a second side of said nucleotide if said adjacent 
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nucleotide on said first side and said adjacent nucleotide on said second side both are of a single 
class; 

1) reclassifying a continuous sequence of less than a defined minimum number of 
nucleotides in a noncoding class having nucleotides in a coding class on both sides to a coding 
class of flajoking nucleotides; and, 

m) reclassifying a coding segment comprising more than one class of nucleotides to a 
most common class in said segment. 

37. The method of claim 35, wherein step b) comprises: 

g) summing probabilities of positive strand states for each of said nucleotides to produce 
a sum of probabilities for positive states; 

h) summing probabilities of negative strand states for each of said nucleotides to produce 
a sum of probabilities for negative states; and, 

i) deciding one of 

I) coding is mixed or not detectable if a first function of said sum of probabilities 
for positive states and said sum of probabilities for negative states is less than a threshold 
value; 

II) coding is on said positive strand if a second function of said sum of 
probabilities for positive states is greater than a third function of said sum of probabilities 
for negative states and said first function is not less than said threshold value; and 

III) coding is on said negative strand if said second function of said sum of 
probabilities for positive states is not greater than said third function of said sum of 
probabilities for negative states and said first function is not less than said threshold 
value. 

38. The method of claim 35, wherein step c) comprises: 

g) determining the points within said nucleic acid sequence in said coding strand at which 
a sum of the probabilities of coding states for each nucleotide drops below a first threshold value 
for a number of nucleotides greater than a second threshold value, wherein ends of an open 
reading frame are indicated at said points. 
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39. The method of claim 35, wherein step a) comprises: 

g) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

h) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

i) determining a probability for said window for each of said states; 

j) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

k) repeating steps g) through]) for each remaining nucleotide in said nucleic acid 
sequence. 

40. The method of claim 35, further comprising: 

g) translating said exons to determine a protein sequence. 

41 . A method for determining a probability for one or more states for a nucleotide in a nucleic 
acid sequence, comprising determining a probability for each of said states for said nucleotide 
based upon a probability of said nucleic acid sequence and a bias, 

42. A method for determining a probability for each of one or more states for more than one 
nucleotide in a nucleic acid sequence comprising: 

a) determining a probability for each of said states for a first nucleotide in said nucleic 
acid sequence based upon a probability of a window in which said first nucleotide is located and 
a bias; and, 

b) repeating step a) for the remaining nucleotides in said nucleic acid sequence. 

43. A program storage device readable by a machine, tangibly embodying a program of 
instructions executable by a machine to perform method steps to determine a probability for each 
of one or more states for a nucleotide in a nucleic acid sequence, said method steps comprising: 
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a) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in said nucleic acid sequence; 

b) determining transition probabilities for each of said states for nucleotides within said 
nucleic acid sequence following said initial oligonucleotide; 

c) determining a probability for said nucleic acid sequence for each of said states; and, 

d) determining a probability for each of said states for said nucleotide based upon said 
probability of said nucleic acid sequence and a bias. 

44. A program storage device readable by a machine, tangibly embodying a program of 
instructions executable by a machine to perform method steps to determine a probability for one 
or more states for more than one nucleotide in a nucleic acid sequence, said method steps 
comprising: 

a) determining an initial oligonucleotide probability for each of said states for an initial 
oligonucleotide in a window of a first nucleotide; 

b) determining transition probabilities for each of said states for nucleotides within said 
window following said initial oligonucleotide; 

c) determining a probability for said window for each of said states; 

d) determining a probability for each of said states for said nucleotide based upon said 
probability for said window and a bias; and, 

e) repeating steps a) through d) for each remaining nucleotide in said nucleic acid 

sequence. 
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Abstract 

The present invention is in the field of bioinformatics, particularly as it pertains to gene 
prediction. More specifically, the invention relates to the probabilistic analysis of nucleic acid 
sequences for the determination of coding features, including determination of state probabilities 
for each nucleotide in a nucleic acid sequence, determination of coding strand, determination of 
open reading frame extent, determination of insertion and deletion location, determination of 
exon location, and determination of protein sequence. 
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Determine probability in each state of 
initial oligonucleotide in nucleic acid 
sequence. 
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Determine transition probabilities in nucleic acid 
sequence for each state. 



sX\ 14 



Determine the probability of the 
nucleic acid sequence in each state. 
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Determine the probability of each state from the probability of 
the nucleic acid sequence given that state and a bias for that 
state. 
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Figure 1 
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Determine probability in each 
state of initial ohgonucleotide in 
window for a nucleotide 
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Determine transition probabilities in window 
for each state 
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Determine the probability of the 
window in each state 
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Determine probability of each state from the probability of the 
window given each state and a bias 
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Next Nucleotide. 
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END 32 



Figure 2 



Determine state probabilities for each nucleotide in the 
nucleic acid sequence for which the coding strand is being 
determined in each state. 
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Determine the sum of the probabilities for the positive 
states for the nucleotides in the sequence for which 
coding is being determined, and set to X. 
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Determine the sum of the probabilities for the negative 
states for the nucleotides in the sequence for which 
coding is being determined, and set to Y. 
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Yes 



Coding is mixed (both 
strands) or not detectable 
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^ , Yes 

Coding is on + strand 



No 



48 




No 



Coding is on - strand 
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Figure 3 



Deteimme state probabilities for each nucleotide in the 
nucleic acid sequence in which ORF is searched for 
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Determine the coding strand (S) 
of the nucleic acid sequence 
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Set L=length of sequence (S)/2 60 



^If the sum of probabilities of the coding states for nucleotide L in 
strand S >= T and L> 1, set L=L- 1 and set P=L- 1 



1. 



If the sum of probabilities of the coding states for nucleotide 
L in strand S <T' and P>1. set P =P-1 
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If the sum of probabilities of the coding states for nucleotide M in strand S >= T' 
and M<Length of the sequence, set M=M-i-l and set Q=M+1 



70 




Figure 4 



Determine state probabilities for each nucleotide 
in the nucleic acid sequence. 



v/X 78 



Set Z= first nucleotide, set W ^ gQ 



Determine the state with maximum mean probability over 
the interval Z to Z4-(W/2) and set to A. 



Determine the state with the maximum mean probability 
over the interval Z+(W/2) to Z+W and set to B 
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Determine the average hypothetical state 
probability for state A over Z to Z+W with 
insertion at Z+(W/2). set to N. 
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Z=Z+1 
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NO 



Determine the average hypothetical state 
probability for state A over Z to Z+W with 
deletion atZ+(W/2), set to M. 
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If N>M then indicate deletion, 
and if M>N then indicate 
insertion. 
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Figure 5 



Determine state probabilities for 
each nucleotide in the nucleic 
acid sequence. 
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Determine strand and extent of the ORF 
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Classify nucleotides as frame I, 2, 3, or 
noncoding according to whichever state J/^ 1 06 
is most probable. 



Apply filters. 
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Determine that regions with classes of 1, 2, 
or 3 are exons and translate to determine 
protein sequence. 
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Figure 6 
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etermine state probabilities for 
each nudeotide in the sequence, 102 



Determine extent and strand of ORF 
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Classify nucleotides as frame 1, 2, 3, or 
noncoding according to whichever state 
is most probable. 
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Where a noncoding nucleotide is flanked by 
nucleotides of the same class (e.g. INl), reclassify 
the noncoding nucleotide as the flanking class 
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Where a nucleotide is flanked by at least two 
nucleotides of a different class (e g , 1121 1), 
reclassify it as the flanking class 
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Where two nucleotides are flanked by at least 
two nucleotides of a different class (e g , 
1 1221 1), reclassify the pair as the flanking class 
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Where two nucleotides are flanked by nucleotides 
of a different class (e g , 1221), reclassify the two 118 
nucleotides as the flanking class 



Where a nucleotide is flanked by nucleotides 
of a different class (e g , 121), reclassify the 
nucleotide as the flanking class. 



120 



Where a noncoding class sequence segment of 
insufficient length is flanked by coding classes, reclassify 
that segment by extending both flanking regions into it 



Where a coding segment is composed of two or three classes, 
reclassify entire segment as the majority class 
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Determine that regions with classes of 1, 2, 
or 3 are exons and translate to determine 
protein sequence 
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Figure 7 
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Figure 8b 
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Figure 8d 




Figure 9b 
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<120> COMPUTATIONAL NUCLEIC ACID CODING AND FEATURE ANALYSIS 
<13 0> 04983 . 0220. OOUSOO 
<160> 4 

<170> Patentin version 3.0 

<210> 1 

<211> 2165 

<212> DNA 

<213> Arabidopsis thaliana 
<220> 

<221> unsure 

<222> (1) . . . (2165) 

<223> Unsure at all n locations 

<220> 

<223> Ecotype Landsberg, genomic DNA 
<400> 1 

tactcaaaaa tatattccat gcttaattag gccggattcg cggtgacgat gcaccaagag 60 

cggtttttcc gagcattgta ggccgtcctc gccacaccgg tgtgatggtt gggatgggac 12 0 

aaaaggatgc ttatgttgga gacgaggctc aatcaaaacg tggtatcttg actctgaagt 180 

acccaattga gcatggaatt gttaataatt gggatgacat ggagaagatt tggcatcaca 240 

ctttctacaa tgagcttcgt gttgcccctg aagaacatcc ggttctcttg accgaagctc 3 00 

ctctcaatcc gaaagctaac cgtgagaaga tgactcagat catgtttgag acattcaata 3 60 

ctcctgctat gtatgttgcc attcaagctg ttctctcact ctatgccagt ggccgtacta 42 0 

CJtggtcagta cattactaca ttctttttat accgtttggt tgaaataaaa ttcggtttgg 48 0 

ttcgattcga gtttgctctc attattttta ttttgttggt taggtattgt tttggactcc 540 

ggagatggtg tgagccacac ggtaccaatc tacgagggtt atgcacttcc acacgcaatc 600 

ctgcgtcttg atcttgcagg tcgtgaccta accgaccacc ttatgaaaat cctgacagag 660 

cgtggttact ctttcaccac aactgctgag cgtgagattg ttagagacat gaaggagaag 720 

ctctcttaca ttgccttgga ctttgaacaa gagctcgaga cttccaaaac aagctcatcc 780 

gttgagaaga gcttcgagct gccagacggt caagtgatca ccatcg^ggc agagcgtttc 840 

cgatgccctg aagttctgtt tcagccatcg atgatcggaa tggaaaatcc gggaattcat 900 
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gaaactactt 



acaactcaat 



catgaaatgt 



gatgtggata 



tcaggaagga tctttatgga 



960 



aacattgtgc ttagtggtgg caccacaatg ttcgatggga ttggtgatag gatgagtaaa 1020 

gagatcacag cgttggctcc aagcagtatg aacatcaaag tggtggctcc accggaaagg 1080 

aagtacagtg tctggatcgg tggctctatc ttggcttccc tcagtacttt ccagcaggta 1140 

aattacttac tatacttaat acataaagtc tattagtgat ttgatgtata aagtgttaca 1200 

aaaatgtgtt ccaaatttgc agatgtggat tgcgaaagcg gagtatgatg aatctggacc 1260 

gtcaatcgtc cacaggaagt gcttctgatc aaaagtcacc aagtaaaaca agagcggtaa 132 0 

aaattttgat atcagttttt caccctgaag ccagttgcta taattactca caacttctct 1380 

atttgtgttc ttttattctt gtccctcgtt gttcatttta atctcttttt tgcaacaaag 1440 

caacttaaaa aaacagagca gtcattaaca gaatgttatt attatatata tgtatacata 1500 

ttagtataca cccattattt cattaaaaca tttatcatat aaggatagga ttctatacat 1560 

cgatatattt attttgttga cactattcag cacatgctta tgtcttatct tgttagtata 162 0 

tgtaaccaaa gacaaataat agatgctaca aattgttttc tttgaagcaa aaatttcaat 168 0 

cttaaaattg tttttttcca ggttacacaa aaaaaacttg tagtttgtaa attttctata 174 0 

caattttggg gatctcaaca agaacatgaa cttcaacttc tagtcatatg acgacctgag 1800 

tctgcgcggc tgtgaatctc tttgctgcag taaatgttta caagtggtgt gtaaattggt 1860 

actgattcaa aagctttaag aaatctacac atttcgtgaa attatttagc agacttgata 192 0 

ttaaaaatct aggataaaat gactatccaa agacaaatag gactgtttca catgttcccc 1980 

tgattcttgt agctcataac tcatcagcag ttaacttttc tacctcatac acgctcgcaa 2 040 

tncgtttgga attatcagct ntaatttttc taattctttg gaaattatta gcagctcgat 2100 

caaatggggc atggcttctt cttctatctg caactcatct aaactttcca tgaagaaaca 2160 

aagct 2165 

<210> 2 
<211> 423 
<212> PRT 
<213> Unknown 

<220> 

<223> Describes a predicted protein sequence 
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<220> 

<221> site 

<222> (1) . . . (423) 

<223> A Stop codon is predicted at all XAA locations 
<400> 2 

Xaa Arg Phe Phe Arg Ala Leu Xaa Ala Val Leu Ala Thr Pro Val Xaa 
15 10 15 

Trp Leu Gly Trp Asp Lys Arg Met Leu Met Leu Glu Thr Arg Leu Asn 
20 25 30 

Gin Asn Val Val Ser Xaa Leu Xaa Ser Thr Gin Leu Ser Met Glu Leu 
35 40 45 

Leu lie lie Gly Met Thr Trp Arg Arg Phe Gly lie Thr Leu Ser Thr 
50 55 60 

Met Ser Phe Val Leu Pro Leu Lys Asn lie Arg Xaa Leu Thr Glu Ala 
65 70 75 80 

Pro Leu Asn Pro Lys Ala Asn Arg Glu Lys Met Thr Gin lie Met Phe 
85 90 95 

Glu Thr Phe Asn Thr Pro Ala Met Tyr Val Ala lie Gin Ala Val Leu 
100 105 110 

Ser Leu Tyr Ala Ser Gly Arg Thr Thr Gly Gin Tyr lie Thr Thr Phe 
115 120 125 

Phe Leu Tyr Arg Xaa Ser Gly Asp Gly Val Ser His Thr Val Pro lie 
130 135 140 

Tyr Glu Gly Tyr Ala Leu Pro His Ala lie Leu Arg Leu Asp Leu Ala 
145 150 155 160 

Gly Arg Asp Leu Thr Asp His Leu Met Lys lie Leu Thr Glu Arg Gly 
165 170 175 

Tyr Ser Phe Thr Thr Thr Ala Glu Arg Glu lie Val Arg Asp Met Lys 
180 185 190 

Glu Lys Leu Ser Tyr lie Ala Leu Asp Phe Glu Gin Glu Leu Glu Thr 
195 200 205 

Ser Lys Thr Ser Ser Ser Val Glu Lys Ser Phe Glu Leu Pro Asp Gly 
210 215 220 

Gin Val lie Thr lie Gly Ala Glu Arg Phe Arg Cys Pro Glu Val Leu 
225 230 235 240 

Phe Gin Pro Ser Met lie Gly Met Glu Asn Pro Gly lie His Glu Thr 
245 250 255 



3 of? 



Thr Tyr Asn Ser lie Met Lys Cys Asp Val Asp lie Arg Lys Asp Leu 
260 265 270 

Tyr Gly Asn lie Val Leu Ser Gly Gly Thr Thr Met Phe Asp Gly lie 
275 280 285 

Gly Asp Arg Met Ser Lys Glu lie Thr Ala Leu Ala Pro Ser Ser Met 
290 295 300 

Lys lie Lys Val Val Ala Pro Pro Glu Arg Lys Tyr Ser Val Trp lie 
305 310 315 320 

Gly Gly Ser He Xaa Val Pro Asn Leu Gin Met Trp He Ala Lys Ala 
325 330 335 

Glu Tyr Xaa Asn Leu Asp Arg Gin Ser Ser Thr Gly Ser Ala Ser Asp 
340 345 350 

Gin Lys Ser Pro Ser Lys Thr Arg Ala Val Lys He Leu Xaa Asn Ser 
355 360 365 

Ser Ala Val Asn Phe Ser Thr Ser Tyr Thr Leu Ala He Arg Leu Glu 
370 375 380 

Leu Ser Ala Leu He Phe Leu He Ser Leu Glu He He Ser Ser Ser 
385 390 395 400 

He Lys Trp Gly Met Ala Ser Ser Ser He Cys Asn Ser Ser Lys Leu 
405 410 415 

Ser Met Lys Lys Gin Ser Xaa 
420 

<210> 3 
<211> 422 
<212> PRT 
< 2 1 3 > Unknown 

<220> 

<223> Describes a predicted protein sequence 
<220> 

<221> site 

<222> (1) . . . (422) 

<223> A stop codon is predicted at all XAA locations 



<400> 3 

Xaa Arg Phe Phe Arg Ala Leu Xaa Ala Val Leu Ala Thr Pro Val Xaa 
15 10 15 

Trp Leu Gly Trp Asp Lys Arg Met Leu Met Leu Glu Thr Arg Leu Asn 
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20 25 30 

Gin Asn Val Val Ser Xaa Leu Xaa Ser Thr Gin Leu Ser Met Glu Leu 
35 40 45 

Leu lie Xle Gly Met Thr Trp Arg Arg Phe Gly lie Thr Leu Ser Thr 
50 55 60 

Met Ser Phe Val Leu Pro Leu Lys Asn lie Arg Xaa Leu Thr Glu Ala 
65 70 75 80 

Pro Leu Asn Pro Lys Ala Asn Arg Glu Lys Met Thr Gin lie Met Phe 
85 90 95 

Glu Thr Phe Asn Thr Pro Ala Met Tyr Val Ala lie Gin Ala Val Leu 
100 105 110 

Ser Leu Tyr Ala Ser Gly Arg Thr Thr Gly Gin Tyr lie Thr Thr Phe 
115 120 125 

Phe Leu Tyr Arg Xaa Ser Gly Asp Gly Val Ser His Thr Val Pro lie 
130 135 140 

Tyr Glu Gly Tyr Ala Leu Pro His Ala He Leu Arg Leu Asp Leu Ala 
145 150 155 160 

Gly Arg Asp Leu Thr Asp His Leu Met Lys He Leu Thr Glu Arg Gly 
165 170 175 

Tyr Ser Phe Thr Thr Thr Ala Glu Arg Glu He Val Arg Asp Met Lys 
180 185 190 

Glu Lys Leu Ser Tyr He Ala Leu Asp Phe Glu Gin Glu Leu Glu Thr 
195 200 205 

Ser Lys Thr Ser Ser Ser Val Glu Lys Ser Phe Glu Leu Pro Asp Gly 
210 215 220 

Gin Val He Thr He Gly Ala Glu Arg Phe Arg Cys Pro Glu Val Leu 
225 230 235 240 

Phe Gin Pro Ser Met He Gly Met Glu Asn Pro Gly He His Glu Thr 
245 250 255 

Thr Tyr Asn Ser He Met Lys Cys Asp Val Asp He Arg Lys Asp Leu 
260 265 270 

Tyr Gly Asn He Val Leu Ser Gly Gly Thr Thr Met Phe Asp Gly He 
275 280 285 

Gly Asp Arg Met Ser Lys Glu He Thr Ala Leu Ala Pro Ser Ser Met 
290 295 300 

Lys He Lys Val Val Ala Pro Pro Glu Arg Lys Tyr Ser Val Trp He 
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305 310 315 320 

Gly Gly Ser He Leu Ala Ser Xaa Gin Met Trp He Ala Lys Ala Glu 
325 330 335 

Tyr Xaa Asn Leu Asp Arg Gin Ser Ser Thr Gly Ser Ala Ser Asp Gin 
340 345 350 

Lys Ser Pro Ser Lys Thr Arg Ala Val Lys He Leu Xaa Asn Ser Ser 
355 360 365 

Ala Val Asn Phe Ser Thr Ser Tyr Thr Leu Ala He Arg Leu Glu Leu 
370 375 380 

Ser Ala Leu He Phe Leu He Ser Leu Glu He He Ser Ser Ser He 
385 390 395 400 

Lys Trp Gly Met Ala Ser Ser Ser He Cys Asn Ser Ser Lys Leu Ser 
405 410 415 

Met Lys Lys Gin Ser Xaa 
420 

<210> 4 
<211> 296 
<212> PRT 

<213> Arabidopsis thaliana 
<220> 

<223> Ecotype Columbia, describes actin 
<400> 4 

Met Glu Lys He Trp His His Thr Phe Tyr Asn Glu Leu Arg Val Ala 
15 10 15 

Pro Glu Glu His Pro Val Leu Leu Thr Glu Ala Pro Leu Asn Pro Lys 
20 25 30 

Ala Asn Arg Glu Lys Met Thr Gin He Met Phe Glu Thr Phe Asn Thr 
35 40 45 

Pro Ala Met Tyr Val Ala He Gin Ala Val Leu Ser Leu Ala Ser Gly 
50 55 60 

Arg Thr Thr Gly Gly He Val Leu Asp Ser Gly Asp Gly Val Ser His 
65 70 75 80 

Thr Val Pro He Tyr Glu Gly Tyr Ala Leu Pro His Ala He Leu Arg 
85 90 95 

Leu Asp Leu Ala Gly Arg Asp Leu Thr Asp His Leu Met Lys He Leu 
100 105 110 
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Thr Glu Arg Gly Tyr Ser Phe Thr Thr Thr 
115 120 



Ala Glu Arg Glu He Val 
125 



Arg Asp Met Lys Glu Lys Leu Ser Tyr He 
130 135 



Ala Leu Asp Phe Glu Gin 
140 



Glu Leu Glu Thr Ser Lys Thr Ser Ser Ser 
145 150 



Val Glu Lys Ser Phe Glu 
155 160 



Leu Pro Asp Gly Gin Val He Thr He Gly 
165 170 



Ala Glu Arg Phe Arg Cys 
175 



Pro Glu Val Leu Phe Gin Pro Ser Met He 
180 185 



Gly Met Glu Asn Pro Gly 
190 



He His Glu Thr Thr Tyr Asn Ser He Met 
195 200 



Lys Cys Asp Val Asp He 
205 



Arg Lys Asp Leu Tyr Gly Asn He Val Leu 
210 215 



Ser Gly Gly Thr Thr Met 
220 



Phe Gly Gly He Gly Asp Arg Met Ser Lys 
225 230 



Glu He Thr Ala Leu Ala 
235 240 



Pro Ser Ser Met Lys He Lys Val Val Ala 
245 250 



Pro Pro Glu Arg Lys Tyr 
255 



Ser Val Trp He Gly Gly Ser He Leu Ala 
260 265 



Ser Leu Ser Thr Phe Gin 
270 



Gin Met Gin Met Trp He Ala Lys Ala Glu 
275 280 



Tyr Asp Glu Ser Gly Pro 
285 



Ser He Val His Arg Lys Cys Phe 
290 295 
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